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HE subject of my address is “Canada, Northern Neighbor.” Since, 
jpeg there has been an intensification of interest in Canada on 
the part of your countrymen, I came to the conclusion that the subject 
I have chosen might have a more general appeal than any other I 
might select. 

Canadians in recent years have been exercising their critical faculty 
on the broader aspects of national identity and characteristics. Some 
have wondered if we really have developed a national consciousness, 
others have described us as “living under the tyranny of the Sunday 
Suit,” as conservative, self-conscious, lacking self-confidence. One, per- 
haps with tongue in cheek, for he has been termed Canada’s literary 
gadfly, declares we possess traits of infantilism in our national life. We 
took heart, however, from a later critic who advanced us to the stage 
of adolescence. If Canada has reached her present status while still no 
more than adolescent, surely we may have hopes for a satisfactory 
maturity! 

Of late there have been great happenings in Canada; spectacular dis- 
coveries in natural resources and industrial developments with impor- 
tant implications for both the United States and Canada which could 
not fail to draw attention to the affairs of the Northern Neighbor. 

Canada possesses more square miles of territory than any other 
nation save the Union of Socialist Soviet Republics and China. Its 
area is 3,846,000 square miles but a mere statement of superficial area 
can be misleading. The vast open spaces of Canada have not seldom 
attracted the envious gaze of nations whose population requires more 
space. In Canada as a whole there are fewer than four persons per 
square mile of territory which compares with over fifty in the U.S.A. 
The Yukon and Northwest Territories, not yet organized into prov- 
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inces, which comprise over 1} million square miles or 39% of the sur- 
face of Canada, had only 25,000 population in 1951. It is the nature 
of these sparsely settled areas and the northern parts of the Canadian 
provinces which accounts for the low population density of the Do- 
minion as a whole. 

An outstanding physical feature of this country is the Canadian 
Shield which has had and will continue to have a profound influence 
on our development. It is a huge U-shaped area of Precambrian rock, 
interspersed with sedimentary and volcanic intrusions, comprising 
1,800,000 square miles or 45% of our total territory. It surrounds the 
great inland sea known as Hudson Bay and extends from the Coast of 
Labrador on the Atlantic, to the Interior Plains of the Prairie Prov- 
inces, and at one point stretches down to the Thousand Islands in the 
St. Lawrence River. At its top it is 1,900 miles across. It covers most of 
Quebec, a very large though somewhat smaller portion of Ontario, 
three-fifths of Manitoba, one-third of Saskatchewan, the northeast 
corner of Alberta, approximately one-half of the Northwest Territories 
and thrusts up in places into the Arctic Islands. 

The Canadian Shield is not suited for agriculture except in scattered 
pockets. If it had been our population would be much larger than it is. 
It is a treasure trove of mineral wealth; its rushing rivers are a source 
of electric power; it has vast forested areas, the source of pulp and paper 
and lumbering industries, and contains many fur-bearing animals; the 
natural beauty of its scenery, its lakes and rivers, many of which 
abound in fish, are a tourist attraction with great drawing power. 

If the Canadian Shield is 45% of our territory, what of the other 55%? 

Time does not permit of a detailed description of the other regions. 
In broad outline it may be said that the south-eastern part of the 
province of Quebec, the three Maritime provinces, Nova Scotia, New 
Brunswick, and Prince Edward Island, and Newfoundland are in the 
Appalachian region. Therefore that area is cut up by mountain 
ranges which limit the amount of arable land. There is no coastal plain 
but there is the submerged grand banks, one of the great fishing areas 
of the world. 

In the central provinces, Quebec and Ontario, the region of heavy 
settlement is narrow because the Canadian Shield reaches far to the 
south. The total area of these two provinces is over one million square 
miles. A section to the south along the lower Great Lakes and the 
Upper St. Lawrence, including the Niagara Peninsula, comprising 
about 50,000 square miles, is known as the St. Lawrence Lowlands. It 
contains between one-half and two-thirds of the population of Canada 
and is the industrial heart of the Dominion. 

From the western limit of this Section, that is, from Lake Huron 
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westward, there is a stretch of a thousand miles before the continental 
plain is reached. That thousand miles is occupied by the southward 
thrust of the Canadian Shield and has been a great barrier to east-west 
communication. Even to-day Canadians motoring, say, from Montreal 
to Vancouver, avoid this stretch by going in transit through the United 
States. Work is progressing on a Trans-Canada highway. In the course 
of a rail journey through this region the passenger will look out upon 
hundreds of miles of wilderness with only occasional centres of settle- 
ment. 

The Continental Plain occupies the greatest portion of the prairie 
provinces, that is, the area not taken up by the Canadian Shield. It has 
deep arable soil from which the bulk of Canada’s grain is produced. 
But these provinces, especially Alberta, have come into the limelight 
recently because of oil and gas discoveries. 

In the extreme West is the Cordilleran Region. Here high mountain 
ranges are interspersed with valleys where mixed agriculture and fruit 
growing are important. Other basic industries are mining, lumbering, 
and fishing. 

North of the Prairie Provinces and stretching to the Arctic Ocean 
lie the Yukon and Northwest Territories. This huge area presents a 
variety of physical conditions; treeless plains in the far north; the roll- 
ing hills of the Canadian Shield; the forested valley of the Mackenzie 
river. Water resources vary from small streams and lakes to the largest 
rivers in Canada, the Mackenzie and Yukon, both over 2,000 miles in 
length, and Great Slave and Great Bear Lakes, each of which is 11,000 
square miles. 

Canada obviously is a land with a difficult topography. There is no 
possibility of continuous settlement throughout all its regions. At 
present Canadians are concentrated in a line some 4,000 miles long 
with thinly populated sections intervening and the great bulk live 
within 200 or 300 miles of the border. 

Canada also has difficult climatical conditions characterized by hot 
summers and cold winters with heavy snowfall. The climate of south- 
ern British Columbia is an exception resembling to some extent that 
of England. Northern Canada, of course, has Arctic conditions and 
no sections of Canada have a climate corresponding to that of southern 
Europe. 

We have survived and developed as a nation, not only in spite of 
topography, geography and climate, but in spite of other influences 
which might have caused another story to be written. In 1760, when 
Wolfe won his victory on the Plains of Abraham, this upper part of 
North America passed into the hands of Britain. The policy adopted 
towards the French inhabitants shortly after was important for our 
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future. They were granted full freedom of worship, speech, and cus- 
toms. This was probably an important reason why they did not favor 
joining the newly formed United States after 1776 despite the fact that 
the colonies of British North America were small, isolated and rela- 
tively weak. This decision to remain in the British orbit created a 
refuge for those in the new Republic who had opposed the Revolution, 
and we had an influx of what have been called United Empire Loyalists 
who have contributed much to the development of Canada. Thus, in 
the very early stage of our history, our big neighbor has strongly influ- 
enced our development. Naturally the United States has continued to 
influence us in many ways. Most of this influence has been good and 
benign but it has not been invariably so. There were the border skir- 
mishes of 1812 and raids at later dates. There was the threat of your 
westward surge and the doctrine of Manifest Destiny. But these events 
spurred Canadians to accomplish what to some seemed impossible in 
order to preserve the nation. The first was the achievement of Con- 
federation by which, in the face of much opposition, four provinces 
entered the federation. These were Nova Scotia, New Brunswick, 
Quebec and Ontario, the two last having been Upper and Lower Canada 
previously. This was in 1867, and in 1872 Prince Edward Island joined. 
Away on the Pacific Coast was another colony—British Columbia. 
Between the settled parts of Ontario and British Columbia stretched 
a practically empty country for two thousand miles, including what 
was to become the great grain growing provinces of the Prairies. British 
Columbia’s condition for entering Confederation was the building of a 
railway across these empty spaces. 

Through miles of Precambrian rock, through regions that seemed to 
be more water than land, across the prairies and through the three 
ranges of the Rocky Mountains the railroad was pushed and the first 
train reached Vancouver in 1886. ‘Thus Canada became bound together 
by a thin line of steel. It was many years afterwards that the great open 
spaces of the Prairies really began to fill up with people instead of 
herds of buffalo. 

The survival of Canada as a nation, in spite of its handicaps in 
topography, geography, and climate and in spite of the siren allure- 
ments of your prosperity, your big markets, and sometimes your pres- 
sures, is one reason why historians declaim on the “Miracle of Canada.” 
There are others. To create a separate nation we had to force our 
economy to a large extent into an east-west development, whereas, 
according to nature in some sections at least, it should be north-south. 
This is particularly true of the extremely west and east provinces. 

Canadian destiny, of course, has been influenced by two great pow- 
ers, Great Britain as well as the United States. Canada, like the United 
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States, has a federal organization but, apart from the federal aspect, 
Canadian constitutional practice is modelled closely on the British 
Parliamentary system. 

Ours is the Parliamentary Cabinet System of government. The Cabi- 
net, headed by the Prime Minister, brings in most of the legislation and 
can, except in very unusual circumstances, count on the support of the 
members of the party. We have a Senate in which members are ap- 
pointed by the Government for life. It is very different from the U.S. 
Senate. In fact, it plays a minor role in the affairs of the nation. 

These and other basic institutions including our judicial system we 
have inherited from the United Kingdom. It is our belief that they 
give soundness and stability to our country, which make it attractive 
and safe for other countries to invest in. There are no obstacles to bring- 
ing capital in or taking it out. 

Canada, a member of the British Commonwealth of Nations, has 
equality of status with the United Kingdom in all her domestic and 
foreign affairs. Canada is a sovereign autonomous state. Some of my 
acquaintances in the United States have been surprised and almost 
skeptical when I told them we pay no taxes to the United Kingdom. 
Our full-hearted allegiance to England’s Queen as Queen of Canada 
carries with it no recognition of England’s government as our govern- 
ment. The concept of a Queen of the Commonwealth seems to many to 
connote some financial obligation. Doubtless the fact that Queen 
Elizabeth is Queen of Canada, and Queen of each Commonwealth Na- 
tion, as well as Queen of the United Kingdom, is something of a puzzle 
to those not directly concerned with the evolution of this political 
achievement. It has even been suggested that the Queen might live part 
of the time in the Capital of each of the Commonwealth nations. 

In this connection, I quote from Professor Lower’s book “Canada: 
Nation and Neighbor”— 


The British Empire of the last century, which was both empire of dom- 
ination and empire of settlement, began to change its nature as Canada 
grew to maturity. The history of Canada’s relation with Great Britain is 
not only the history of self-government of a colony of settlement but it is 
also the history of the reshaping of imperial institutions to reconcile them 
with self-government. Canada was determined to have the fullest measure 
of self-government; she was also determined to have it without disrupting 
her association with the mother-country. The result was the change, so 
frequently described “from Empire to Commonwealth.” Canada may not 
unfairly claim to have been the principal architect of the British Common- 
wealth. 


At the end of nearly a century of existence as a nation Canadians 


have some justification for pride in their achievements. The accomplish- 
ments of the past and the scope and magnitude of post-war develop- 
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ments have aroused high hopes for the future. Canadians believe that 
Canada has a rendezvous with destiny. 

From 1946 to 1953 Canada has had an outstanding period of devel- 
opment. Our net gain in population during the period was more than 
23 million. New investments in fixed capital were over 30 billion 
dollars. From 1946 to 1953 twenty-one per cent of our Gross National 
Product went into capital investment. 

Though capital expansion commenced immediately after World War 
II for reequipment, reconversion and modernization, needs for which 
had accumulated during the war economy and to some extent in the 
thirties, it soon went beyond the stage of catching up. New develop- 
ments were bursting out in all directions. There were great projects 
for the utilization of resources including power, minerals, and wood 
products; the discovery of major oil fields in the West; the mechaniza- 
tion of agriculture which amounted to a revolution in farming methods 
and expansion in various lines of manufacturing. 

Canada’s development as an urban and industrial economy is evi- 
dent in the fact that in 1900, 40 per cent of her labour force was en- 
gaged in agriculture. In 1946 the proportion was 25%, in 1953 it was 
16%. 

Along with this high rate of economic activity and expanded pro- 
ductive capacity, there has gone a considerable rise in the standard of 
living. The production of goods and services—the gross national 
product—has increased after correction for price changes by one-third 
from 1946 to 1953. In the United States the increase was 29%. Ca- 
nadians enjoy a material standard of living second only to that of the 
United States. 

An outstanding aspect of our recent progress is the development of 
natural resources. It is almost certain that we shall eventually be self 
supporting in oil on balance. Pipe lines have been constructed which 
take the oil to the Pacific Coast and to Eastern Canada markets. At 
the same time tremendous supplies of natural gas have been developed. 
A 2200-mile all Canadian pipe line is to be constructed from Alberta 
to deliver natural gas to Ontario and Quebec. This, it is claimed, will 
be the greatest installation of its kind in the world. Production of 
natural gas has risen from 48 billion cubic feet in 1946 to 101 billion 
in 1953. Disposable reserves are estimated to be over 15 trillion cubic 
feet. 

These developments are altering profoundly the prairie economies. 
The three prairie provinces, situated in the west half of Canada, have 
been predominantly agricultural though not entirely so. Gas and oil 
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are bringing diversification not only as primary materials but in Al- 
berta have given rise to an important development in petro-chemical 
industries. 

What is happening in our prairie provinces is only one example of 
the important developments which have been taking place across 
Canada and in our far northern areas. They have been the subject of 
many articles in American periodicals some of which I am sure you 
have seen. 

The facts of Canada’s growth and development are so many and 
varied that there can be no doubt about them. But what of the future? 
Far be it from me to don the robe of the prophet. However, it might be 
useful at this point to summarize some of the facts on which the future 
will be built and on which we base our hopes. 

First, there is the fact of our natural resources. 

Our forests, farm, fishing and trapping industries with research and 
planning can be organized on a perpetual yield basis. Sustained yield 
logging and improved silviculture techniques in our forests, improved 
methods of cultivation and new strains in cereals, restocking of fishing 
waters, conservation of wild life and general emphasis on conservation 
could ensure the use of these resources in perpetuity. 

As to foodstuffs, we shall continue to be an exporter of high quality 
wheat and other grains, and of the products of our fisheries but as 
population increases there will be less of other foods to export. 

Stored in the Canadian Shield are the raw materials of many metals 
which are an indispensable foundation for our machine age. The de- 
velopments in the Precambrian rock of this shield with its many intru- 
sions already have been spectacular, and much of this area is practically 
unexplored. It would be amazing if the future does not bring many 
more discoveries just as spectacular. 

The Paley Report says “The United States has crossed the great 
industrial divide and from being a nation with a surplus of raw ma- 
terials has become a deficit nation.” If what this report says is true, the 
United States will be increasingly interested as time passes in the riches 
of the Canadian Shield and in other mineralized areas of Canada. 

There is an important corollary to the discovery of new mineral 
areas. The opening of the iron mines of Labrador-Quebec included the 
building of 360 miles of railway. The new mining town at Lynn Lake 
in Northern Manitoba involved the building of a railroad of 155 miles. 
All such developments mean a further opening up of the northern 
areas. Great reserves of lead and zinc have been discovered at Pine 
Point on Great Slave Lake, which is 800 miles from the International 
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Border. Eventually this territory also is likely to be reached by rail- 
road. So the east-west development of Canada is now being supple- 
mented by one in depth. 

These resources of the forest and mine are not a sufficient basis for 
confidence in the future in themselves. We would be condemned to be 
hewers of wood and drawers of water for more fortunately situated 
countries if we did not possess another indispensable resource for a 
country with ambition to become highly industrialized; that is, abun- 
dant supplies of energy. Low cost hydro-electric power has been fun- 
damental in the development of our industries. For example, in our 
pulp and paper industry a combination of forest resources, rushing 
streams to carry down the logs and furnish hydro-electric power, has 
given us a leading position. 

The only reason we can have an aluminum industry is the existence 
of abundant low cost hydro-electric power. Kitimat is an excellent ex- 
ample. Our metallurgical and electro-chemical industries are dependent 
on this same source of energy. 

Our supply of power has increased steadily. The present turbine 
installation of 15 million horse power represents less than a quarter of 
our economic potential of nearly 66 million. This does not include the 
potential at the Grand Falls on the Hamilton River in the Quebec- 
Labrador iron ore area. New developments are tending to make it 
practicable to transmit power economically over much larger distances, 
and the harnessing of this great falls is under discussion. The potential 
is estimated as of the order of 10 million horse power, It is almost im- 
possible to keep up to date in these developments. Only recently it has 
been announced that preliminary talks were under way to promote a 
Canadian plan for a giant hydro-electric development which would 
harness the Yukon river and provide for a new industrial area in the 
far northwest. Full utilization of the water flow available in the Yukon 
and northern British Columbia could, eventually, produce 4 to 5 mil- 
lion horse power. 

But hydro-electric power may not be enough for our future needs 
even when augmented by that from the St. Lawrence Waterways and 
other developments. It could be supplemented by the coal of the 
Maritimes, the Prairies, and British Columbia, by the oil and gas 
carried by pipe lines from the Prairies, and eventually by the use of 
atomic energy from the uranium of which Canada is a leading pro- 
ducer. 

The fact is that Canada is in the process of being bound together by 
a veritable network of equipment for the transmission of power sources 
—hydro-electric transmission lines, oil and gas pipe lines. 

More than natural resources and power is necessary for the successful 
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development of the Canada of the future. We need more population; 
but even there we are making progress. Our population has been in- 
creasing in recent years at a rate of about 23% per annum and is now 
over 15,000,000. It is a very pessimistic person in Canada who does not 
visualize a population of between two or three times our present one 
at the end of the century. Taking current birth and death rates and 
current net immigration and assuming they continue at the same rates, 
we would have a population of 18 or 19 million in 1961, 223 million in 
1971. 

Of course, also, the growth and development of Canada will continue 
to require a vast amount of capital investment. Oil and gas can be dis- 
covered and brought to the surface, mines developed, industries built 
up, pipe lines and railroads constructed, towns erected, only with the 
aid of capital. These investments in their development and subsequent 
operation provide the work opportunities for a growing population. 

Have we good reasons for thinking the capital will be forthcoming to 
build the future Canada? Let us look at the record. While Canadians 
have financed by far the largest part of post-war investment, non- 
resident capital has been an important source of financing some major 
developments in recent years. At present other nations have $11.2 
billion worth of capital invested in Canada. Of this amount $8.6 billion 
is owned in the United States compared with $4.1 billion in 1939. 
From 1945 to 1953 the United States increased its investments in 
Canada by $3.6 billion. 

Americans have $3.6 billion of portfolio investments in Canada which 
pay interest or dividends but involve no control of the enterprise. Direct 
investments amount to $5 billion which are wholly owned or in which 
there is majority control in the United States. 

These direct investments in branch plants or subsidiary or controlled 
organizations are another instance of the benefits which our neighbor 
has bestowed on Canada. They have accelerated the pace of develop- 
ment of our natural resources and hastened our industrialization. They 
are advantageous from our point of view because when in operation 
in many cases they supply the means of repayment. A mine or pulp 
and paper mill produces commodities which can be exported to the 
United States. Also, successful direct investments usually mean a 
ploughing back of profits in the Canadian branch or subsidiary. It is 
a permanent type of investment and multiplies. Such operations reduce 
the net demand for U.S. dollars on the part of the Canadian economy 
and, therefore, relieve strains on our balance of payments. One of the 
greatest benefits, of course, is that along with the creation of the 
branch plant or subsidiary come the technical skills and knowledge 
accumulated through years of experience by the parent concern. 





10 AMERICAN STATISTICAL ASSOCIATION JOURNAL, MARCH 1955 


Canada, therefore, has welcomed these large American capital in- 
vestments, and their welcome has been all the heartier because there 
has been a growing tendency to train and bring Canadians into the 
management and also to permit Canadians to participate in the fi- 
nancing. On the whole these branches have adapted themselves to the 
Canadian atmosphere and have respected Canadian aims and co- 
operated in their attainment. Except in name, in actual operation 
they are scarcely distinguishable from Canadian concerns. 

Of course, I would not have you think that this large American finan- 
cial interest in Canada means that your countrymen own the bulk of 
it. Even in the four industries in which American capital is mostly 
concentrated—manufacturing, mining and smelting, petroleum ex- 
ploration and development, and public utilities, Canadian ownership 
increased from 62% in 1939 to 68% in 1951, and the proportion owned 
by Canadians of all the national wealth is very much greater than this. 

All things considered, lack of capital should not be an obstacle to the 
growth of Canada. In fact, Canada is not now heavily dependent on 
outside sources for capital. Approximately 80% of the post-war ex- 
pansion was financed by Canadians. 

Our ability to keep our trade, not only commodity but also invisible 
items such as interest, dividends, tourist trade, and a host of others in 
balance, is another factor on which future development will depend. 
This problem, of course, is known technically as the Balance of Inter- 
national Payments. 

During the year 1953 the debits in our current account with the 
United States exceeded our credits by $924 million. We had a credit 
balance with all other countries of $485 million which gave us an over- 
all debit of $439 million. Failing any other source of meeting this ad- 
verse balance, it would have been necessary to draw on our national 
reserve of United States dollars. But the change in this reserve was 
not significant. The explanation of that phenomenon is the flow of 
United States and other funds into Canada for investment, which off- 
set the adverse balance on current account and created such a demand 
for Canadian dollars that the latter has been at a premium over the 
United States dollar. 

Suppose, however, there was a slowing up of this capital investment, 
what would be the situation if a large deficit remained in our current 
account? Is it to be expected that year after year this huge investment 
of capital by United States corporations and individuals will continue 
without at least periods of slowing up? This brings us face to face with 
the question of trade relationships. 

Exports are the life blood of the Canadian economy. The products of 
our farms, mines, forests, and fisheries are far beyond the capacity of 
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our small population to consume. The importance of foreign trade is 
evident in the fact that in 1953 we ranked third in both exports and 
imports among the trading nations of the world, surpassed only by the 
U.S.A. and U.K. 

In view of the importance of foreign trade to Canada, our govern- 
ment, throughout the post-war period, has worked for the reduction of 
existing barriers to international trade. Canada is one of the few coun- 
tries in the world which has almost no significant barriers to imports 
other than tariffs, and the Canadian tariff has been reduced consider- 
ably since the war. Incidentally, Canada has never received directly 
one dollar of Lend Lease or Marshall Plan aid. Instead, she has con- 
tributed substantially in mutual aid, loans and other forms of interna- 
tional assistance like the Colombo Plan. 

Our needs have set Canada in the very forefront of international 
efforts for freer trade. This policy has been pursued in the face of dis- 
couraging obstacles. Many of the obstructions to overseas trade were 
created by the imposition of quotas and embargos so that the mutual 
tariff concessions negotiated in the General Agreement on Tariffs and 
Trade have yielded little benefit to Canada. Canada has lived up to the 
agreements but overseas countries could not because with them tariff 
was not the difficulty but the fact that they were short of dollars. 

The United States adoption of the Reciprocal Trade Agreement 
Program and some other supporting developments, such as the original 
sponsoring of the General Agreement on Tariffs and Trade, were land- 
marks in this field. Recently, however, progress has been retarded, and 
in certain cases there has even been a movement in reverse. 

There has been disappointment that simplification of your customs 
procedures, particularly in respect to the system of classification and 
valuation, has not progressed much farther than is the case. Resump- 
tion of the discussions next spring of these trade barriers, in many 
cases more protective than tariffs, will be followed here and abroad 
with keen interest. 

In spite of barriers of which we wish there were fewer, there is an 
enormous flow of goods across the border. The United States is our 
best customer. In 1953, 59% of our exports were to the U.S.A. The 
other side of the picture is that we are a very good customer since we 
purchased 73.5% of our imports from your country in 1953. In mone- 
tary terms our exports to the United States amounted to approximately 
$23 billion and our imports from the United States to approximately 
$3 billion. Since the war we have imported annually, on the average, 
over $500 million worth of goods more than we have sold to your 
country. Incidentally, 56% of our exports to the United States in 1953 
were raw or partly manufactured materials. The other 44%, i.e., fully 
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processed materials and finished goods includes the important item, 
newsprint, in which the value added by manufacture is low. On the 
other hand, 84% of our imports from the United States were fully 
processed materials and finished goods. 

How does this trade situation affect our hopes for the future? There 
are a number of factors which will have a bearing on it. 

First, the question of trade is, of course, part of the wider subject— 
the balance of international payment. When one nation invests heavily 
in another, a large part of the investment usually takes the form of 
a flow of commodities. If the investment diminishes, the commodity 
flow lessens. Thus the adverse balance of trade on the receiving end 
would be reduced. 

Second, hopes are rising that those countries which have been short 
of dollars are nearing the stage where convertibility of currencies will 
be achieved—a necessary basis for freer world trade. 

Third, a great deal of the development which has been going on in 
Canada has yet to reach the production stage. Its fruition should affect 
our trade picture markedly. It is difficult to be pessimistic about the 
future of Canadian trade when one thinks of the aluminum which is 
to come out of Kitimat, the iron ore from Ungava, the higher produc- 
tion of nickel and other non-ferrous metals from numerous new mines, 
titanium from the Allard Lake region when the process is working 
commercially, and the possibility of another new metallurgical colossus 
in the northwest through the harnessing of the mighty Yukon. Ca- 
nadians look to the development of the St. Lawrence Waterways as 
another means of strengthening their economic position. 

There will be, of course, increased offsets to earnings from exports by 
transfers of profits on U.S. capital investments, but, at the same time, 
a number of developments will diminish our need for imports. We shall 
arrive at self sufficiency in oil, and gas flowing through the Trans 
Canada gas pipe line to the east will replace some imported coal. 

The growth of our population will mean a reduction in the amount 
of foodstuffs we shall have for export. On the other hand, Canadian 
growth and development and our rapidly expanding economy will pro- 
duce opportunities for the domestic production of manufactured or 
semi-manufactured goods which have been imported in the past. This 
is true already in our domestic iron and steel industry and in a variety 
of chemical products. 

(It is interesting to note that in the period 1926-1929 exports of 
goods and services from Canada accounted for some 29% of gross 
national expenditure, in 1936-39 the proportion was about 28%, 
and in 1950-53 about 23%. The growth in Canada’s population and 
the domestic market seems slowly to be reducing the proportion of 
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Canada’s resources devoted to direct production for the foreign market.) 

Nevertheless, Trade could be our Achilles Heel. Canada is dependent 
on foreign trade to a much greater extent than the United States with 
its huge domestic market. Canada will thrive more than most nations 
in an era of peace and growing prosperity. In a depressed economic 
climate or dollar shortage she will have her problems. With sanity and 
vision on the part of labour and management in relation to costs; 
given a world in which the forces of peace gain the upper hand and in 
which progress towards the ideal of raising world standards of living 
can be made, the economic future of Canada surely should be very 
bright. 

Some of the lesser developed nations surpass Canada in resources of 
some minerals, or in forest areas, or in power potentials. Canada has 
no rival among them in the combination of mineral and forest resources, 
energy potentials, a belief in free enterprise, the stability of sound 
government, and a safe harbor for capital investment. Canada un- 
doubtedly has an impressive array of assets and I am sure the United 
States would like to see our country continue its progress so that this 
whole North American continent may present the strongest front to 
authoritarian aggressiveness. 

Man shall not live by bread alone. Canada might have great ma- 
terial prosperity and yet be in danger of losing her own soul. At least, 
this is the warning which many eminent Canadians have been voicing 
in recent years. Our concern about cultural influences led to the ap- 
pointment in 1949 of a Royal Commission on National Development 
in the Arts, Letters and Sciences which produced a masterly report. 
It has become known as the Massey Report after its Chairman, who 
is now the first Canadian to be Governor General of the Dominion. 

It emphasizes the fact that Canada owes a great debt to the United 
States for the generous manner in which your cultural facilities have 
been made available to Canadians. The gifts of the Carnegie Corpora- 
tion and the Rockefeller Foundation; the aid of the Guggenheim 
Foundation and the American Association for the Advancement of Sci- 
ence to a host of students seeking opportunities for further study; the 
fellowships and scholarships awarded to Canadian students in American 
Universities, tell a tale of magnificent generosity. 

Moreover, our nearness to the United States and many similarities 
in our ways of life mean that a vast mass of other cultural media 
pour across the border. These include radio, films, newspapers, peri- 
odicals, books, and your touring musicians and artists. Although in 
recent years Canadian periodicals have strengthened their position 
greatly, they have had to face almost overwhelming competition. 
Canadians read more American periodicals than they do Canadian, 
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with the exception of local newspapers. 

All this influence, however, has not been an unmixed blessing for the 
development of Canadian culture. Educational opportunities made 
available to Canadians has resulted in a loss of much talent since many 
students, when they finish their courses in American institutions, ac- 
cept positions in the United States and do not return to Canada. 

Perhaps we have been too dependent on American generosity. The 
brief of the National Conference of Canadian Universities to the Royal 
Commission commented “American generosity has blinded our eyes to 
our necessities. Culturally we have feasted on the bounty of our neigh- 
bors, and then we ask plaintively what is wrong with our progress in 
the Arts.” 

Here I might mention that since the Report of the Royal Commis- 
sion was made public the Dominion Government has instituted a sys- 
tem of special grants to Universities. 

In Canada, as in the United States, the battle between the funda- 
mentalist and progressivist philosophies of education has been joined. 
The need for more education in the humanities has been given increas- 
ing emphasis. Quite recently Hilda Neatby, a professor in one of our 
Canadian Universities, wrote a book severely critical of the present 
situation. Entitled “So Little for the Mind” it became a best-seller in 
the non-fiction field and has aroused a storm of discussion which may 
be the forerunner of reform. 

Canadians, of course, can tune in on American radio programs and 
have as their movie film diet mainly the Hollywood output. Many of 
your radio programs are superlative in character and afford immense 
enjoyment to numerous Canadian listeners. There are other aspects of 
American broadcasting which have been severely criticized in your own 
country. The Massey Report urged “That in Canada measures should 
be taken to avoid in our Radio and T.V. at least those aspects of Ameri- 
can broadcasting which have provoked in the United States the most 
outspoken and the sharpest opposition.” 

Our Canadian Broadcasting system is the result of recommendations 
made by a Royal Commission on Radio, known as the Aird Commission 
appointed in 1928. The view has been accepted that radio is a Public 
Trust and not just another industry; its social influence is so great that 
for the welfare of its citizens government must take some responsibility 
in the matter of control. In the Canadian Broadcasting Act of 1936 the 
Canadian Broadcasting Corporation (C.B.C.) was created and given 
wide powers of control. Broadcasting was to become a force fostering 
a national spirit and interpreting national citizenship. The C.B.C. 
policy is to concentrate Canadian resources on the development of pro- 
grams essentially Canadian and to supplement them with the best 
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available entertainment from other countries, which means mainly the 
United States. 

The C.B.C. exercises control by virtue of the fact that it recom- 
mends to the Minister of Transport the grant, renewal or cancellation 
of licenses to private operators. Also, it can regulate the nature and 
amounts of advertising, political broadcasts, and generally, the charac- 
ter of all programs whether public or private. 

While, of course, there is room for improvement in the operations 
of C.B.C., the Massey Report states that it “has opened the way to 
mutual knowledge and understanding which would have seemed im- 
possible a few years before. Canadians as a people have listened to 
news of their own country and of the world, have heard public topics 
discussed by national authorities, have listened to and participated in 
discussions of Canadian problems, and have, through radio, been pres- 
ent at great national events.” In short, C.B.C. has been of immense 
importance in developing our cultural identity. 

Such is the system concerning radio in Canada. Since television also 
comes under the jurisdiction of the C.B.C., its development may be 
expected to follow the same pattern. 

Time will not permit me to enlarge on this subject. While it is true 
that mediocre movies, the juke box, soap operas, comic papers, jazz 
and so on occupy too prominent a place in our cultural atmosphere, it 
is also true that there is a widespread and growing interest in music, 
drama, painting and the fine arts generally. In the last two summers 
there was a record-breaking movement of visitors from all directions 
to witness the Shakespearian Drama Festival at Stratford, Ontario. 
Canadian artists have won acclaim abroad and Canadian musical com- 
posers were thrilled recently by the success of the all-Canadian concert 
at Carnegie Hall under the leadership of Leopold Stokowski. As time 
passes Canadians will make distinctive contributions to the culture of 
North America. 

In conclusion, we Canadians do not wish to become a pale and insipid 
copy of the great American people. We wish to have a distinctive 
Canadian way of life which, while benefiting by your great qualities, 
will evolve also from a fusion of our own rich French, Anglo-Saxon 
and other heritages, developed and refined in the crucible of Canadian 
experience. We must think for ourselves and stand on our own feet. 
Sometimes, therefore, we may have the temerity to disagree with you! 
With the fundamentals of your way of life there will be complete ac- 
cord. The northern neighbor will be one with you in promoting and 
defending the freedoms which have been inherited and enhanced, and 
in all efforts to secure peace, goodwill, and higher standards of living 
for the human race. 











THE POPULATION OF THE UNITED STATES IN 1950 
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T 1s clear to all who consider the question that data from the United 

States decennial censuses of population are less than perfect. Per- 
sons who should be included are omitted, others are counted twice, and 
characteristics of the persons included are sometimes misreported. This 
article is addressed to errors of omission and mistaken inclusion in the 
1950 census, and to the erroneous classification of persons according to 
their age, sex, and color. No mention will be made of census data relat- 
ing to kinship, employment status, occupation, education, income, etc. 

We will survey the evidence that reveals imperfections in the census, 
analyze some of the evidence, and offer a set of numbers which we be- 
lieve come closer than census figures to the United States population in 
1950, classified by age, sex, and color. 


EVIDENCE OF IMPERFECTIONS IN CENSUS DATA 


Anyone familiar with the scale of effort required to complete a census 
of population and with the necessarily limited resources made available 
for the job should not be surprised that the enumeration is less than 
completely exact. The fact that some 140,000 inexperienced enumera- 
tors and field supervisors serving with moderate pay and no prospect of 
prolonged employment conducted the actual house-to-house canvass 
leads naturally to the expectation that not every person in every dis- 
trict was properly counted. No amount of care and planning by the 
Census Bureau could insure perfection on the part of all the members of 
a large army of temporary employees. Similarly, the best designed 
questionnaire and enumerator’s instructions could not guarantee that 
the respondent would give and the enumerator record accurate in- 
formation about all the members of a given household. 





* Many of the data on which this article is based were supplied by the Bureau of the Census 
from unpublished material. The author wishes to express his appreciation to the Census Bureau for 
making this material available. In addition to giving the author access to unpublished data, members 
of the Bureau's staff were very generous in giving advice and comments at various stages in the prepara- 
tion of this work. Among those who were especially helpful are Henry S. Shryock, Jr., Henry D. Sheldon, 
and Richard A. Hornseth of the Population and Housing Division; and Joseph F. Daly and Leon Pritzker 
of the Office of the Assistant Director for Statistical Standards. At the same time that their help is 
acknowledged, it must be made clear that neither the Bureau of the Census nor any of the members of 
its staff in any way accepts, approves, or endorses the estimates herewith presented. 
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Internal Evidence 


Though the nature of the census leads us to expect imperfect data, 
an examination—even an exhaustive one—of census procedures would 
not give any quantitative indication of errors. To obtain clues as to 
how many persons are missed and how many are misclassified requires 
a comparison of census numbers with other numbers. Actually, a 
limited amount of information about errors can even be obtained from 
a comparison of census numbers with other numbers from the same 
census. 

Two informative kinds of “internal” comparisons are sex ratios and 
age ratios [7]. Sex ratios are formed by dividing the number of males at 
any age by the number of females; age ratios are formed by dividing 
the number of persons in any five-year age group by the average of the 
numbers in the two adjacent age groups. These two kinds of ratios will 
reveal census errors because in each instance we can estimate roughly 
what the ratios ought to be. 

As for sex ratios, one would generally expect the ratio of males to 
females in a closed population to exceed unity among young children, 
and to decline continuously as age advances. This expectation is based 


_on the almost universal tendency for male live births to exceed female 


by some 3 to 6 per cent, and for female mortality to be less than male at 
every age. The native white and native nonwhite populations are pre- 
sumably a near approximation to “closed” populations, particularly 
when the “population abroad” is included. Figure 1 shows sex ratios by 
age in 1950 for native whites and nonwhites, along with the ratios one 
would expect on the basis of the sex ratio at birth and the survival rates 
(taken from life tables) to which each cohort has been subject.’ In 
neither instance do the ratios follow the values one would expect. The 
sex ratios among young adults are particularly far from expected values. 
Nonwhite sex ratios deviate from expected values by more than 10 per 





1 The “expected” sex ratios were calculated by taking appropriate values from United States life 
tables as far back as these have been prepared (to 1920), and life tables (for the white population) 
prepared for the Death Registration Area back to 1900. Thus to obtain the expected sex ratio at ages 
40-44, we multiplied the probability of surviving from 30-34 to 40-44 in 1940-1950 (taken from the 
1945 life table) times the probability of surviving from 20-24 to 30-34 in 1830-1940, times the probabil- 
ity of surviving from 10-14 to 20-24 in 1920-1930, times the probability of surviving from 0-4 to 10-14 
in 1910-1920 (this last estimated by averaging 1909-11 and 1919-21 life table values). This procedure 
was followed separately for males and females. The resulting male value was multiplied by 1.055 for 
whites and 1.03 for nonwhites (representing the sex ratio at birth) and divided by the female value. 
The life table mortality experience of the male cohorts aged 20-24 through 45-49 was corrected for 
estimated excess mortality due to military deaths in World War II. Though this procedure may be 
inexact (e.g., because it employs life tables for whites instead of native whites and for the Death Regis- 
tration Area instead of all states), the fact is that sex ratios are not very sensitive to changes in mortality 
experience. Hence, except for errors in the life tables themselves caused, for example, by differential 
reporting of male and female deaths, the “standard” sex ratios should be a reliable guide. 
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Fic. 1. Males per female by five-year age groups. Native whites and non- 
whites (including persons living abroad) according to the 1950 census and ac- 
cording to a “cohort” life table. 

[Life table data from which cohort life tables were derived are from U. S. life tables for 1945, 1930- 
39, 1920-29, an average of tables for 1909-11 and 1919-21, and 1901-1910. The 1945 life table was pub- 
lished in: U. 8. National Office of Vital Statistics, Vital Statistics—Special Reports, Vol. 23 (1947). 
The 1930-39 life table was published in: U. S. Bureau of the Census, U. S. Abridged Life Tables, 1930-39. 
(Preliminary.) The earlier life tables are in: U. S. Bureau of the Census, U. S. Life Tables, 1980. The 
war-caused excess mortality was calculated from figures given in: U. 8S. National Office of Vital Statis- 
tics, Vital Statistics of the U. S., 1949. Part I.] 


cent at ages 20 through 39, and to a similar degree (though in the 
opposite direction) at ages 55 through 64. 

A roughly similar pattern of sex ratios has characterized earlier 
censuses. In Figure 2 the sex ratios from the last three censuses for 
native whites and for nonwhites are shown. In all three censuses the 
young adult ages have too low a sex ratio, while the sex ratio for the 
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older ages is clearly too high in all instances except among native 
whites in 1950. There is a marked tendency for the sex ratio at older 
ages to progress toward more reasonable values from 1930 to 1950. The 
differential reduction in mortality rates (with male rates reduced more 
than female) would account for only a small part of the reduction in 
census sex ratios. 
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Fig. 2. Males per female by five-year age groups according to the 1930, 1940, 
and 1950 censuses. Native whites and the nonwhite population. 1950 ratios for 
whites include the population living abroad. 
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The calculation of what the age ratios ought to be is a little more 
complicated. Crudely speaking, age ratios ought to be unity, on the 
assumption that numbers in adjacent age groups progress smoothly, 
and do not depart markedly in a short interval from a straight line. 
We may distinguish three factors, however, that would cause age ratios 
to depart from unity. The first factor is the tendency (if any) of typical 
mortality experience to produce a strongly non-linear age distribution. 
This factor seems to be unimportant up to very advanced ages [7]. The 
second factor is any temporary variation in birth or migration rates 
which produces an unusually large cohort. The third factor is the tend- 
ency for ages to be reported so that the proportion of persons in some 
age groups is overstated and the proportion in others understated. This 
tendency might be called age-heaping, and “five-year age-heaping” 
when we are dealing with five-year age groups. 

Since we are concerned here with census errors, we are interested in 
the last of these factors—that of five-year age-heaping. The first factor 
may be assumed to have negligible importance, and the second factor— 
unusual cohort size—we will try to eliminate. We eliminate the effect of 
unusual cohort size by dividing each age ratio by an adjusted average 
of the age ratios for the same cohort in every census in which it has been 
enumerated. Thus if the cohort aged 40-44 in 1950 were unusually 
large (relative to the adjacent cohorts) and consequently had a large 
age ratio, one would normally expect the age ratio for the group 30-34 
in 1940, for the group 20-24 in 1930, and 10-14 in 1920 to be above 
unity. Since the cohort is enumerated at a different age in each census, 
there should be a tendency for “age-heaping” to cancel out of the 
average age ratio for the cohort. Thus one would expect a given age ratio 
for a particular census divided by the average ratio for the cohort to 
differ from unity primarily because of age-heaping in that census. 

The reader will have to bear with us through one more stage of com- 
plication. We said above that there should be a tendency for age-heap- 
ing to cancel out of the average for a cohort enumerated in several 
censuses. However, if ages in the first half of each age decade (i.e., ages 
10-14, 20-24, etc.) are generally preferred in age reporting, the average 
age ratio for every alternate cohort would tend to be above unity, while 
the average for those in between would be below one. An examination of 
average age ratios for cohorts indeed shows a general alternation of large 
and small values. It was therefore decided to adjust the average cohort 
ratios for the typical average age-heaping to which a cohort is subject 
at the particular ages where it was enumerated. Thus the average age 
ratio for the cohort aged 30-34 in 1950 (which was 20-24 in 1940, and 
10-14 in 1930) was divided by the average age ratio of all groups aged 
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10-14, 20-24, and 30-34 in 1930, 1940, and 1950. Since this last average 
covers five different cohorts, it will not be dominated by any unusually 
large or small cohort. If this average differs from unity, the difference 
arises from the average tendency of the ages 10-14, 20-24, and 30-34 
to be favored or avoided. Division by this average adjusts the cohort 
age ratio to make it more nearly reflect cohort size alone. 

In Figure 3 corrected age ratios for the 1930, 1940, and 1950 censuses 
are charted. The following features are noteworthy: 

1. There is a strong similarity of the pattern of age-heaping from 
census to census, among both sexes, and amorg whites and nonwhites. 

2. Age-heaping is much more pronounced among nonwhites than 
among the white population. The nonwhite age ratios differ from unity 
by as much as 20 per cent, while the largest divergence for the white 
population is less than 5 per cent. 

3. There is a tendency up to age 40 to prefer the last five years of 
each age decade, and to avoid the first five years. This tendency could 
be the result of a reluctance to pass certain 10-year birthdays. On the 
other hand, the ratio at ages 50-54 is consistently high, and at ages 
55-59 consistently low. 

4. There is a tendency, especially clear-cut among the white popula- 
tion, for the ratios at any one age to progress in the same direction 
from 1930 to 1940 to 1950. Usually this progression is toward unity— 
especially where the age ratio departs radically from 1.0 in 1930. In 
other words, five-year age-heaping, while occurring at about the same 
ages, has generally been diminishing. 

5. There is one noteworthy exception to the similarity among the 
age-ratio patterns in 1930, 1940, and 1950. This exception is the de- 
parture of age ratios for ages 60-64 and 65-69 in 1940 and 1950 from 
the 1930 values, especially among the nonwhites, to a lesser degree 
among white females, and not perceptibly among white males. The 
character of the change is that the 60-64 group became smaller relative 
to neighboring age groups, while 65-69 became much larger. This 
change is from a pattern which extends well back of the 1930 census. 
The change in age-ratio pattern confirms a shift in age reporting which 
has been often noticed and commented on.? The unexpectedly large 
number of persons 65-69 and the unexpectedly small number 60-64 
have been attributed to the old-age assistance legislation which came 
into effect between 1930 and 1940. The minimum age of 65 necessary to 
qualify for benefits may have caused persons for whom otherwise an 
age of 60-64 would have been reported to be reported as 65-69. The 





*U. 8S. Bureau of the Census. U. S. Census of Population: 1940, Vol. IV. Characteristics by Age» 
p. 3; and U. S. Life Tables and Actuarial Tables, 1989-41, pp. 110-112. 
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Fie. 3. Population in each five-year age group divided by average of popula- 
tions in adjacent age groups for whites and nonwhites, 1930, 1940, and 1950, 
adjusted for cohort size by dividing each age ratio by corrected average ratio for 


cohort. 
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change since 1930 is greatest among nonwhite females, nearly as great 
among nonwhite males, much smaller but still detectable among white 
females, and not apparent among white males. These differences sug- 
gest that a nonwhite as opposed to a white, and a female as opposed to 
a male, is most likely to apply for old-age assistance when under 65. 

The existence of heaping in the single-year age distribution of the 
population is another internal indication that misclassifications occur. 
By Myers’ blended method [4], one can calculate the fraction for whom 
an age ending in each digit is reported in a way that would yield almost 
exactly 10 per cent for each digit if age were accurately reported. Table 
1 shows the apparent preference as indicated by this method for digits 
of age in United States censuses from 1880 to 1950, and in 1950 for the 
nonwhite population as well as the total. For the whole population in 
1950, the proportion with reported ages ending in zero was about 12 per 
cent larger than it should have been; for the nonwhite population, the 
proportion was apparently about 32 per cent too large. Note that im- 
proper digit choice has steadily declined since 1880. 


TABLE 1 
PER CENT OF POPULATION WITH EACH TERMINAL 
DIGIT 0-9 USED TO REPORT AGE IN U. S. 
CENSUSES, 1880-1950 





















































Total population (whites and nonwhites) Whites Nen- 
Digit whites 
1 
1880 1890 1900 1910 1920 1930 1940 1950 1950 1950 
0 16.8 15.1 13.2 13.2 12.4 12.3 11.6 11.2 10.9 13.2 
1 6.7 7.4 8.2 7.7 8.0 8.0 8.5 8.9 9.0 7.3 
2 9.4 9.7 9.8 10.2 10.2 10.3 10.4 10.2 10.2 9.9 
3 8.6 9.1 9.3 9.2 9.4 9.4 9.6 9.7 9.8 8.8 
4 8.8 9.0 9.5 9.4 9.4 9.6 9.7 9.7 9.8 9.2 
5 13.4 12.3 11.3 11.5 11.3 11.2 10.7 10.6 10.5 11.5 
6 9.4 9.6 9.4 9.6 Oe 3 9.6 9.6 9.8 9.8 9.4 
7 | 8.5 8.9 9.3 9.1 9.4 9.3 9.6 9.7 9.8 9.4 
8 10.2 10.4 10.2 10.7 10.6 10.5 10.3 10.2 10.1 10.7 
9 8.2 8.5 9.7 9.4 9.6 9.8 10.0 10.1 10.1 10.6 
Average 2.08 1.56 .94 1.12 .90 . 86 .60 45 .36 1.20 
per cent 
deviation (Calculated so that unbiased use of terminal digits would yield 
from 10% 10% for each digit) 
Sources: 


From 1880 to 1930: Myers, Robert J., “Errors and bias in the reporting of ages in census data.” 
Transactions of the Actuarial Society of America, Vol. 41 (Part 2), No. 104, 1940, pp. 395-415. 

1940: U. S. Bureau of the Census. Sixteenth Census of the United States, 1940. U. S. Life Tables, 
1989-41, p. 121. 

1950: Computed from U. S. Bureau of the Census. U. S. Census of Population: 1960. Vol. II, 
Table 94. 
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External Evidence 


Any close estimate of the numbers of persons omitted from the 1950 
census and the numbers with specified misreporting of characteristics 
requires a comparison with information derived from some outside 
source. Among the other sources which have been examined in attempt- 
ing to appraise the 1950 census (or earlier censuses) are [1]: 

1. Estimates of the number of young persons derived from the births 
registered in the years before the census, and from the deaths among 
young children recorded between the date of birth and the census 
period. 

2. Estimates of persons over 10 years of age derived from earlier 
censuses, and from death and international migration statistics for the 
intercensal interval. As will be seen later it is primarily by an extension 
of this method that the age-sex-color distribution for 1950 offered be- 
low is obtained. 

3. Draft registration of males of military age. The selective service 
registration of 1940, which was a mandatory registration for all draft- 
age males in the United States (with certain exceptions, such as those 
already in the armed services) in October, 1940, provides numbers 
which, after adjustment for deaths, migration, and differences in cover- 
age, can be compared with the 1940 census. 

4. The Post Enumeration Survey conducted in 1950. This survey 
provided more carefully determined numbers for a representative sam- 
ple of areas in the United States. From the discrepancies observed in 
- these areas, it is possible to estimate what the nationwide discrepancies 
would have been had the Post Enumeration Survey’s careful enumera- 
tive methods been used nationally. One of the important features of the 
Post Enumeration Survey is that it permits a name-by-name check on 
part of the census population rather than merely a comparison of totals. 
This feature is particularly important to an investigation of the report- 
ing of personal characteristics by the census. 

We will leave to one side for the moment all comparisons except those 
derived from birth data and other censuses. The differences, with re- 
gard to persons classified as white males, between numbers derived 
from this source and numbers from the 1950 census are shown in 
Figure 4. 

Differences between the expected and census population in 1940 and 
1950 form remarkably similar patterns. The pattern alone does not tell 
us much, however, about shortages and overcounts in the censuses, 
since an apparent undercount can be caused (for example) by an 
actual undercount, by an overcount of the same cohort in the earlier 











[ 1955 


1950 
stics 
side 
npt- 


rths 
ong 
sus 


rlier 
the 
sion 
be- 


vice 
aft- 
10Sse 
ers 
ver- 


vey 
am- 
1 in 
cies 
2ra- 
the 
on 
als. 
ort- 


10se 

re- 
ved 
| in 


and 
tell 
ses, 

an 
lier 











POPULATION OF U.S., 1950—REVISION OF CENSUS FIGURES 25 
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Fic. 4. Differences between the population expected on the basis of the pre- 
ceding census and estimated cohort changes and the enumerated population, as a 
per cent of enumerated population, 1940 and 1950, white males. 


census, or by an underestimate of the intercensal decline in the size of 
the cohort. 

The pattern in Figure 4 can be explained in part by the age-heaping 
we have already discussed. The excess of expected overenumerated 
population at 45-49 is as one would expect—since, judging from 
Figure 3, 35-39 seems to be a more strongly preferred age group than 
45-49. But the very low level of the curves at 10-14 and the relatively 
high level from 15-34 would not be expected on the basis of age-heaping 
evidence. On the other hand, if enumerated white males under five are 
compared with the number expected on the basis of registered births 
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(corrected for underregistration), an apparent undercount of 7 per cent 
in 1940 and about 4.5 per cent in 1950 is revealed. Thus the deficiency 
of expected compared to the enumerated population at ages 10-14 may 
reasonably be attributed to a relatively large undercount of the cohort 
in an earlier census. 


THE HYPOTHESIS THAT RECENT CENSUSES ARE 
SUBJECT TO SIMILAR ERRORS 


Much of the evidence we have surveyed so far would support the 
supposition that the 1930, 1940, and 1950 censuses were characterized 
by errors which form similar patterns. By similar patterns we mean 
that the variation of undercounts and overcounts according to age, 
sex, and color are alike from one census to the next. 

Some of the factors lending credence to this hypothesis are: 

1. Many features of census taking which would be most likely to 
affect omissions, erroneous inclusions, and misclassification by ge, sex, 
and color did not change substantially. Persons were enumerated at 
their places of regular residence in all three censuses; all three enumera- 
tions required a period of several weeks to complete; in all instances the 
enumerators were temporary employees selected by similar procedures; 
the questions were generally asked of one respondent for each house- 
hold in all three censuses; age was recorded in response to essentially 
the same question (“How old was he on his last birthday?”) and in no 
instance in response to a question about date of birth; etc. There were 
some procedural changes that may well have altered the pattern of 
errors somewhat. The introduction of special infant cards in 1940 (for 
children under four months of age) may have had a tendency to reduce 
the undercounting of the very young; the new rules in 1950 designating 
the college rather than the home as the usual residence of college stu- 
dents may have affected errors in counting college-age groups; and the 
“T-Night” procedure in 1950 may have caught transients who would 
have been missed by the techniques used in earlier censuses. Neverthe- 
less, the general similarity of census-taking methods would support the 
belief that if young adult males were “missed” in 1950, they were 
probably missed in 1940 and 1930 as well; and that if ages were mis- 
reported in a particular way in one census, they were similarly misre- 
ported in the others. 

2. The pattern of age ratios by sex and color, and sex ratios by age 
and color showed striking similarities in the three censuses. (See Figures 
2 and 3.) As to age ratios, the most notable shifts of pattern (at ages 
60-64 and 65-69) are explainable by new legislation. The sex ratios in 
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all censuses are too low among young adults and too high among per- 
sons 40-64. There is, however, an observable tendency for extreme age 
ratios to progress toward unity, and for sex ratios among older persons 
to move toward more reasonable values. 

3. Finally, when populations expected on the basis of earlier censuses 
and estimated cohort changes are compared with enumerated popula- 
tions, the strongly similar pattern of discrepancies suggests that omis- 
sions and misstatements are similar from one census to the next. 

We propose to use the hypothesis of similar errors as a means for 
forming new estimates of overcounts and undercounts in the 1950 
census. The hypothesis is the basis for new estimates when combined 
with estimated intercensal cohort changes, and with independently 
estimated errors for special age groups in one census. One determines 
the error in 1950 for certain age groups—say 5-9 and 10-14—by an 
independent method, estimates the errors in an earlier census at the 
same ages by means of the hypothesis of similarity, corrects the earlier 
census population, and finally forms new estimates of the expected 
population 15-19 and 20-24 by adding intercensal cohort changes to 
the revised figures in the earlier census. The repeated use of this pro- 
cedure will generate estimates of errors at all ages. The essence of the 
procedure is very simple—an assumption of similarity enables us to 
estimate errors in earlier censuses at the same ages for which errors are 


calculated in a later census; while cohort changes combined with known 
errors in an earlier census permit us to estimate errors at later ages in 
the later census. 


SYSTEMATIC COMBINATIONS OF ESTIMATED COHORT CHANGES 
AND THE HYPOTHESIS OF SIMILAR ERRORS 


We will use several variants of the hypothesis that error patterns 
were similar in the last three censuses, and will note the resulting differ- 
ences in estimated undercounts and overcounts. Later on we will try to 
take account of the previously noted differences among the censuses 
with respect to five-year age-heaping. 

Specifically, the following procedures are employed: 

1. The population in 1950 at ages 0-4, 5-9, and 10-14 by color and 
sex is estimated from registered births (corrected for underregistration), 
deaths, and migration. These estimates are deemed correct, and differ- 
ences between them and the 1950 census are interpreted as census 
undercounts. 

2. The 1930 and 1940 censuses are assumed subject to the same per 
cent undercounts at ages 5-9 and 10-14 as the 1950 census. 











28 AMERICAN STATISTICAL ASSOCIATION JOURNAL, MARCH 1955 


3. Estimates of intercensal cohort changes (made by the Bureau of 
the Census from registered deaths and from migration statistics) are 
added to the estimated numbers at age 5-9 and 10-14 in 1930 and 1940 
to obtain numbers at ages 15-19 and 20-24 in 1940, and 15-19, 20-24, 
25-29, and 30-34 in 1950. The census undercounts or overcounts at 

these ages are then calculated. 

' 4, It is assumed that (a) the 1930 census was subject to the same 
per cent undercount or overcount as the average of the 1940 and 1950 
censuses, or (b) that the 1930 census was subject to an undercount the 
same as that in 1940 or 1950, whichever was smaller. 

5. These assumptions permit the calculation of a corrected popula- 
tion at ages 15-19 and 20-24 in 1930; and by adding cohort changes, 
one can estimate the numbers 25-29 and 30-34 in 1940, and 35-39 and 
40-44 in 1950. By an iterative process it is possible to continue the 
calculations to the oldest ages. 

6. Another variant using the assumption of similarity in the pattern 
of errors is to calculate undercounts at ages 0-14 in 1950 as above, and 
to assume for ages above five that the 1940 census was subject to the 
same per cent under- and overcounts as the 1950. 

Table 2 presents undercounts and overcounts by age, color, and sex 
in 1940 and 1950 according to three different assumptions of similarity 
in error patterns. In the first columns of Table 2 it is assumed for ages 
above 15 that per cent over- or undercounts in 1930 were the same as 
in 1940 or 1950, whichever was less; in the middle columns average of 
1940 and 1950 miscounts are assumed for 1930; while in the last 
columns 1940 miscounts are assumed equal to those of 1950. 


A MORE REFINED HYPOTHESIS OF SIMILAR ERROR PATTERNS 
IN THE 1930, 1940, AND 1950 CENSUSES 


The three versions of the hypothesis of similar error patterns used as 
the basis for Table 2 take no account of the differences among the last 
three censuses in five-year age-heaping which are evident in Figure 3. 
A more refined version of the hypothesis we1:ld assume that the errors 
in one census were the same as in another, after explicit allowance for 
observed changes in five-year age-heaping. The reason such refinement 
is worthwhile is that it produces a more logically consistent set of age- 
by-age estimates of errors in 1950. Figure 3 suggests, for example, that 
ages 35-39 were more strongly preferred in 1930 than in 1940 or 1950. 
Indeed our rough calculations (briefly described in Appendix I) yield 
estimates that 3.8 per cent fewer nonwhite males were heaped into ages 
35-39 in 1950 than in 1930. On the basis of this estimate, one would 
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of ’ TABLE 2 
re ESTIMATED ERRORS IN THE i940 AND 1950 CENSUSES AS A 
40 PER CENT OF THE CENSUS POPULATION BY SEX, COLOR, 
4, AND 5-YEAR AGE GROUPS ACCORDING TO 3 DIF- 
at FERENT ASSUMPTIONS ABOUT SIMILAR 
PATTERNS OF ERROR 
ne 7 pres e Assuming errors in 1930 equal 
Assuming errors in 1940 
50 to the average of 1940 and to those in 1940 or 1950, nf 
he Age 1950 errors whichever is less 
wM | wr | Nw | NWF wo | wr | NWM| NWF wo | wr | Nw | NWF 
of 1950 
8, iidkootoia 
id 0-4 4.5| 3.8] 1 | 45] 3.8] WW 10 | 45] 3.8| 1 10 
‘ 5-9 3.1| 2.5] 12 10 | 3.1) 2.5] 12 10 | 3.1/ 2.5] 12 10 
e 10-14 ti} 44 7 a7] oaa] a3 7 7 nil] 11 7 7 
15-19 2.6) 1.5] 18 122 | 2.6| 1.5| 18 122 | 2.6] 1.5) 18 | 12 
20-24 2.3 9] 19 6 | 2.3 9] 19 6 | 2.3 9} 19 6 
rn 25-29 4.4| 1.5] 27 9 | 44] 1.5] 27 9 | 28 8| 27 1 
id 30-34 4.1] —.3| 27 4] 41] —.3] 27 4 1.7 3) 32 10 
‘ 35-39 2.8| —.6! 20 7 1.9} 1.0] 20 5 1] -1.3] 25 7 
€ 40-44 1.4] 2.1] 21 14 2] 1.8! 18 10 | -1.0| —.5| 26 7 
45-49 2.6| 3.1] 23 17 | 2.1}| 2.7) 2 | 16] 0.0 6] 27 15 
: 50-54 4] 1.0) 14 Si «ti «41 @ 1 | -2.7] —.5] 25 9 
2X 55-59 6.7| 7.8| 36 | 44 | 5.8| 8.4] 33 | 40 8] 3.6] 38 | 32 
iy 60-64 71 49] 2 | 30 | -1.0) 2.5] 18 15 | —3.0| 1.9| 42 19 
65-69 1.7] 2.2| 21 gi -s 9] 15 0 | -6.0| -3.0|] 13 | —10 
eS 70-74 ~.5| 2.3] 44 15 | —2.3 3) 34 | -1 | -0.6 8) a] — 
aS 75+ -7.5| -6.2| 73 | 34 |—11.1] -9.1| 47 | 20 |-24.9]-11.3| 113 | 48 
of Total 2.6| 1.8] 19 il 2.0| 1.4] 18 9 2 6| 22 10 
st 
1940 
0-4 7.1| 65| 19 | 17 | 7.1) 6.5] 19 17 | 45] 3.8/ w 10 
5-9 3.1| 2.5] 12 100 | 3.1] 2.5] 12 10 | 3.1) 2.5| 12 10 
10-14 1.0] 1.1 7 7 1.0] 1.1 7 7 i] 1.1 7 7 
15-19 4.1| 2.2] 18 10 | 41] 22] 18 10 | 2.6) 1.5| 18 12 
20-24 4.7 4) 5 o| 47 4] 5 o| 23 9] 19 6 
1s 25-29 5.5| 1.6] 22 10 | 4.6] 1.2| 22 9 | 2.8 8| 27 Ml 
t 30-34 4.1| 2.8] 27 | 17 | 2.9] 2.6] 2 14 1.7 .3| 32 10 
3 35-39 2.6| 11| 21 9 | 21 71 19 8 | -1.3] 25 7 
3. 40-44 1.9} 1.0] 16 6 8] —.5| 13 0 | -1.0| -.5| 26 7 
45-49 5.2| 44] 2 | 23 | 4.4] 50] 23] 2 | 0.0 .6| 27 15 
's 50-54 3] 2.2] 4] a7 | 14 al 9 7 |-8.7) -s| # 9 
ir 55-59 6.9| 82] 46 | 48 | 5.1) 7.1) 41 43 8| 3.6] 38 | 32 
t 60-64 3.1| 40] 45 | 30 | 1.9] 2.5| 39 18 | -3.0| 1.9| 42 19 
L 65-69 9 71 01 <@] —8!] =. 4 | -10 | —6.0| —3.0| 13 | —10 
“ 70-74 | —2.7 71 2% | 1 | —5.4]-12] 1 | -2 | -9.6 5) 0] — 
t 75+ -3.4|-4.5| 10 | -9 | -5.3| -6.0| -—1 | -9 |-10.5| -7.0| 51 29 
). Total 3.5| 2.4] 19 122 | 2.8] 1.9] 17 | 10 8} 1.0| 22 | 10 
d Sources: 
j Changes in coborts taken from U.S. Bureau of the Census. Current Population Reports, Series P-24, No. 94, 1954. 
d Estimates of the population 0-4 in 1940 and 0-14 in 1950 based on births adjusted for underregistration, deaths, 


and migration from the same source. The method of calculation is described in the text. 
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assume that if nonwhite males were undercounted by +% in 1950, the 
undercount in 1930—being compensated by greater age-heaping— 
would be only (x—3.8)%. The effect in 1950 (because of our technique 
of using cohort changes) is to reduce the estimated undercount at ages 
55-59 by about 3.8 per cent. This more refined hypothesis is used in 
preparing the figures we finally select as our estimates of the 1950 
population. The more acceptable age ratios in the corrected population 
—evidence that the age-to-age errors are more logically consistent—are 
shown in Figure 6. More will be said about these age ratios later.* 


VALIDITY OF THE ERROR ESTIMATES BASED ON THE HYPOTHESIS 
OF SIMILAR ERROR PATTERNS 


The corrections to the 1950 census derived from the hypothesis of 
similar error patterns may be tested in several ways, none entirely con- 
clusive or satisfactory. We can examine critically the steps by which 
the estimates are constructed; we can show how the estimates would 
be affected by plausible variations in the basic figures used; we can test 
the internal consistency of the corrected figures by computing age and 
sex ratios; and we can compare these corrections with other error esti- 
mates—notably those derived from selective service figures for young 
males in 1940, and from the Post Enumeration Survey in 1950. 


Critical Appraisal of the Method Employing the Hypothesis of Similar 
Errors 


The validity of this method of estimating census errors depends, of 
course, on the truth of the assumption that census errors are similar as 
postulated, and also on the accuracy of the data we have on births and 
on intercensal cohort changes. The evidence underlying the assumption 
that errors form a similar pattern from census to census has been sum- 
marized earlier and will not be re-examined. We will, however, examine 
briefly the quality of the data on which the error estimates rest. 

The number of live births in the United States from 1935 to 1950 
plays a key role in this method of estimating census errors. Estimates of 
the “correct” population up to age 15 depend directly on these birth 
data, and a 1 per cent error in estimating the number of births would 
cause a corresponding error in estimating the corrected population 
under 15. Moreover, because of the iterative character of the method, 
a 1 per cent change in birth figures would bring about a very nearly 
equal change in the corrected population at all ages. 

The birth figures employed are the number of registered live births, 





3 See pp. 35-39 below. 
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corrected for underregistration. It is fortunate that birth registration 
has been subject to check on two occasions so that convincing estimates 
of underregistration are available. Estimates of underregistration are 
based on the results of matching individual registrations against indi- 
vidual census records in 1940 and 1950. From these matchings the de- 
gree of underregistration in each match period (four months in 1940 
and three months in 1950) was estimated separately by state of birth, 
color, and according to whether the birth occurred in a hospital or else- 
where. Corrections to registered births between 1940 and 1950 were 
based on the assumption that completeness of registration for each 
category of birth changed linearly between the two tests. For the years 
1935-1939 underregistration within each category was assumed con- 
stant at the 1940 level.‘ 

The third support on which this method rests—in addition to the 
assumption of similar error patterns and the accuracy of birth data— 
is the quality of the estimates of intercensal changes in cohort size. The 
estimates of cohort changes we have used were prepared by the Bureau 
of the Census [10]. 

Cohorts change in size because of death and international migration. 
The estimates of cohort changes from 1930-1940 and 1940-1950 were 
based on somewhat different methods. For the earlier decade the frac- 
tion of each cohort surviving was calculated from the U. 8. Life Table 
for the decade. Net international migration was assumed to be zero for 
every cohort. (This assumption rests on the fact that the official data 
on international movements for the total population show only a few 
thousand net migrants during the intercensal period.) For the interval 
between the 1940 and 1950 censuses, cohort changes due to mortality 
were estimated from annual registered deaths. Estimated decedents 
were subtracted from each cohort; whereas from 1930 to 1940, cohort 
attrition was estimated by means of a survival ratio. Cohort changes 
due to migration in the later decade were estimated from data supplied 
to the Census Bureau by the Immigration and Naturalization Service. 
These data included information on age, color, and sex only for alien 
migrants; the age, color, and sex of citizen migrants (including migrants 
to and from U. S. possessions) had to be determined for the most part 
by guesswork. 

The only allowance made for underregistration of deaths was the 
assumption that deaths to children under one are underregistered by 
the same per cent as births. No test for the completeness of death regis- 


4 The matching procedure has been described in some detail [2, 5, 6], and the estimates used here 
of the corrected number of births for each year have been published in an official source {11 ]. 
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tration comparable to the tests of birth registration has been attempted. 
However, it is generally believed among public health statisticians (on 
the basis of the plausible thesis that a physician is more likely to be in 
attendance or a public official to be notified in the event of a death than 
in the event of a birth) that deaths are more completely registered than 
births [12]. 

On the other hand, the survival rates used for the 1930-1940 period 
depend on population figures as well as on registered deaths, and this 
dependence introduces an additional possibility of error in the esti- 
mates of cohort change. Moreover, even if death registration were suffi- 
ciently complete for our purposes, erroneous reporting of the age of 
decedent, and reporting of his color which is not consistent with census 
classification, could be the source of false estimates of the change in 
specific cohorts. 

There is a widespread and credible impression that international 
migration statistics are deficient. A recent instance serving as a source 
of skepticism is the illegal and presumably unrecorded entry of large 
numbers of “wetbacks” from Mexico. There are other notable deficien- 
cies in the migration portion of the estimated intercensal cohort changes 
in addition to unrecorded entries and exits. The negligible level of total 
migration between 1930 and 1940—if the over-all figures are accepted as 
valid—does not really imply that there were no significant gains or 


losses among particular cohorts. It is more likely that there were slight 
gains among the younger adult cohorts and slight losses among the 
older ones. Also the estimation of the age, color, and sex of non-alien 
migrants between 1940 and 1950 is little more than a guess. 


The Effect of Plausible Variations in the Data on the Estimated Correc- 
tions 


The effect of variations in the assumption of a similar error pattern 
is illustrated in Table 2. The identical value of the estimated errors 
under age 25 in Table 2 for all three assumptions merely reflects the 
fact that in each instance errors at ages 5-9 and 10-14 are assumed 
equal to those in 1950, and errors under 15 in 1950 are derived from 
birth data. 

The significance of the differences in Table 2 will be more apparent if 
we first consider what would be the effect on this method of estimating 
errors if one census differed slightly from others at all ages. Suppose, 
to be specific, that the 1930 census in addition to having the character- 
istic errors of age-heaping, and of omission in particular age groups 
(such as young children) had more or less consistently omitted 1 per 
cent more persons at each age than the later censuses. The result would 
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be that the “expected” population in 1940 and 1950 would be too small 
starting with the earliest ages dependent on the assumption of similar 
errors; and hence the undercount estimated at these ages would be 
about 1 per cent too small (or the overcount 1 per cent too large). 
These mistaken estimates of errors in 1940 and 1950 would cause a 
correction 2 per cent too small to be applied to the corresponding age 
groups in 1930. The net effect would be estimates of a purportedly cor- 
rected population in 1950 which would be 1 per cent too small at young 
adult ages, 2 per cent too small twenty years older, and 3 per cent too 
small forty years older. 

If only two censuses are involved—as is the case in the third part of 
Table 2—the cumulative effect of a consistent 1 per cent deficiency in 
the earlier census is more rapid and pronounced. Thus if the 1940 
census were 1 per cent less fully enumerated at all ages than the 1950 
census, the “correct” population at ages 15-24 would be 1 per cent 
underestimated; at 25-34, 2 per cent underestimated; at 35-44, 3 per 
cent; etc. 

The foregoing feature of our method (that a systematic difference in 
censuses produces errors which cumulate at the older ages) accounts for 
many of the differences in Table 2. Thus one would expect the assump- 
tion of average rather than minimal errors in 1930 to produce a larger 
corrected population in 1950, with the differences becoming greater 
with advancing age. Also the relatively small undercount at 25-44 and 
the rapidly rising indicated overcount at the upper ages among white 
males in the third part of the table suggests that white males were 
generally subjected to a smaller count in 1940 than in 1950. This im- 
pression is confirmed by Figure 4 which shows the population expected 
on the basis of the preceding census minus the enumerated population 
expressed as a per cent of the latter. This difference is uniformly lower 
in 1950 than in 1940, as would be the case if white males were con- 
sistently subjected to a small count in 1940 as compared to 1950 and 
1930. As a result we will discard (at least for the white population) the 
assumption that errors in the 1940 census were the same as in 1950, and 
restrict our choice to some form of assumption about the 1930 census. 
The assumption of smaller errors in 1930 is conservative, though the 
fact that it leads to estimates of substantial overcounts above age 65 
may mean that it is too conservative with regard to the white popula- 
tion. 

x * * 

Defects in our basic data would affect the validity of the estimated 
errors in varying degree. We have already pointed out that a given per 
cent mistake in the estimated completeness of birth registration will 
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produce an error of equivalent magnitude in the corrected census 
population. However, the estimates of birth underregistration seem 
well grounded. 

The effect of deficiencies in the mortality data would be much less, 
because cohort attrition due to death is less than 10 per cent up to age 
50 among whites and up to age 40 among nonwhites. Thus an under- 
registration of deaths by 10 per cent would cause less than a 1 per cent 
error in estimating the size of a cohort under those ages. Table 3 shows 
what would be the cumulative effect among nonwhite males on esti- 
mated errors in 1950 if deaths are assumed to be 5 per cent more 
numerous than the registered figures. If the general belief that deaths 
are more fully registered than births is correct, our estimated errors are 
not subject to serious question because of death registration. 


TABLE 3 


APPARENT UNDERCOUNTS IN THE 1950 CENSUS AS A PER CENT 

OF THE CENSUS POPULATION FOR NONWHITE MALES, ASSUM- 

ING THAT THE 1930 CENSUS WAS SUBJECT TO THE PER CENT 

ERROR IN 1940 OR 1950, WHICHEVER WAS LESS, AND 5 PER 
CENT UNDERREGISTRATION OF DEATHS 












































10-}; 15-| 20-| 25-| 30-| 35-| 40-/ 45- 50-| 55-| 60- 
0-4/ 5-0 14 | 19 | 24 | 29 | 34 | 39 | 44 | 49 | 54 59 | 64 65+ 
No Underregistration 11} 12] 7 | 18} 19 | 27 | 27 | 20; 18 | 20} 11 | 33 | 18 | 29 
5% Underregistration 11} 11 6 | 18] 19 | 26 | 25 | 18 | 16} 18 7 | 28 | 11) 18 























The magnitude of errors introduced into our estimates by defective 
migration data can only be guessed at. Unrecorded immigration would 
produce an underestimate of the correct population; unrecorded emi- 
gration an overestimate. Some impression of the effect of plausible varia- 
tions in the estimation of cohort change arising primarily from different 
treatment of migration data can be derived from Table 4. In both parts 
of Table 4 errors in the 1950 census are calculated from the same as- 
sumption about error patterns—namely, that the 1930 census was sub- 
ject to the lesser of the errors in 1940 and 1950. The two sets of esti- 
mates differ in that they depend on different estimates of cohort 
changes between 1940 and 1950. Both estimates were derived by the 
Bureau of the Census.’ The mortality figures used were the same in 
both instances (except for a different estimate of military deaths in 
World War II). The principal differences arise from different estimates 





5 One set of estimates is from U. S. Bureau of the Census. Current Population Reports, Series P-25, 
No. 98, August 1954. The other is derived by interpolation between an estimate for July 1, 1949 and 
July 1, 1950. Current Population Reports, Series P-25, No. 39,and Statistical Abstract of the U. S., 1951, 
p. 10. 
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of net migration, including different assumptions about the age, sex, 
and color of non-alien migrants. Another source of difference is that the 
first estimates of cohort change were compiled with the benefit of more 
complete information relating to the armed services overseas in 1950. 


TABLE 4 


ESTIMATED ERRORS IN THE 1950 CENSUS AS A PER CENT OF 
THE CENSUS POPULATION, BY SEX, COLOR, AND 5-YEAR AGE 
GROUPS ACCORDING TO TWO ESTIMATES OF COHORT 
CHANGES BETWEEN 1940 AND 1950 

















Cuhert hangs bem Cohort changes interpolated from 
| Current Population Regerts Current Population Reports, 
po Series P-25, No. 98 Series P-25, No. 39 and Statistical 
, F Abstract of the U. S., 1951 
WM WF NWM NWF WM WF NWM NWF 
0-4 4.5 3.8 1l 10 4.5 3.8 1l 10 
5-9 3.1 2.5 12 10 3.1 2.5 12 10 
10-14 1.1 ) 7 7 2.2 2.2 7 7 
15-19 2.6 1.5 18 12 2.4 1.5 18 12 
20-24 2.3 9 19 6 3.0 1.1 17 5 
25-29 4.4 1.5 27 9 3.5 1.3 20 8 
30-34 4.1 -.3 27 4 2.4 0.0 16 4 
35-39 1.9 —-1.0 20 5 1.3 -.8 14 5 
40-44 -2 1.8 18 10 1.5 iP 15 10 
45-49 2.1 2.7 20 16 1.8 2.9 16 15 
50-54 —.9 —.5 11 1 —.7 -.1 1 1 
55-59 5.8 8.4 33 40 5.6 8.5 25 40 
60-64 —1.0 2.5 18 15 —.6 2.8 7 14 
65-69 — .6 9 15 0 —-.4 1.2 16 4 
70-74 —2.3 3 34 -1 -1.9 —.2 33 10 
75+ -—11.1 -—9.1 47 20 —13.7 -—10.3 0 -1 
Total 2.0 1.4 18 9 1.8 1.5 14 9 





























In both sets of estimates it is assumed that errors in the 1930 census were equal to those of 1940 
or 1950, whichever was less. 


Sex and Age Ratios for the Corrected Population 


In Figures 1 and 3 we subjected the 1950 census to tests of internal 
consistency and found that sex and age ratios deviated from plausible 
values. We will now put populations corrected by the method of similar 
errors and cohort change to the same tests. We will test several differ- 
ent sets of corrections, which will be labeled A, A’, B, and B’. Each set 
of corrections is derived from the assumption that the 1930 census was 
subject to errors equal to the smaller of the errors in 1940 and 1950. 
The differences between the primed and unprimed corrections is that 
the primed corrections allow for changes in age-heaping, while the un- 
primed do not. The A corrections use the estimated cohort changes for 
1940-1950 published in 1954, the B corrections use earlier estimates of 
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cohort change. The following diagram summarizes the different bases 


for the four sets of corrections: 
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Fia. 5. Males per female by five-year age groups, native whites and nonwhites 
(including persons living abroad) according to the 1950 Census of Population, 
according to corrections A and B, and according to a cohort life table. 
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Figure 5 shows sex ratios among the native white and nonwhite 
populations in 1950 according to the census as calculated from life 
tables and according to corrections A and B.* Among the white popula- 
tion either correction A or correction B produces a more plausible set 
of sex ratios from age 15 to age 40 than does the census. Above age 40, 
the corrected sex ratios seem consistently too low. Among nonwhites, 
the sex ratios resulting from the B corrections are a striking improve- 
ment over the census, and are also much less erratic than the sex ratios 
resulting from the A corrections.’ This superiority leads us to prefer the 
B corrections for the nonwhite population. 

Figure 6a shows age ratios for 1950, divided by adjusted cohort age 
ratios for the census, and for corrections A and A’, while Figure 6b 
shows the same ratios for corrections B and B’. The significant fact to 
be observed in this figure is that a crude assumption of similar errors in 
1930—i.e., either corrections A or B—leads to age ratios which are less 
credible than those in the census itself. The age ratios in the population 
corrected by methods A or B deviate from unity just as far as in the 
census and exhibit a regular sawtooth pattern which shows the existence 
of some kind of systematic error. However, when either set of correc- 
tions is altered by assuming that the 1930 census was subject to the 
error in a later census modified by calculated differences in age-heaping, 
a set of age ratios closer to unity and essentially free of a regular saw- 
tooth character is produced.® 

Among white males the best age ratios result from the B’ rather than 
the A’ corrections, while among nonwhite males the A’ corrections re- 
sult in slightly better age ratios. 

In all instances but one, the age ratios resulting from the refined 
hypothesis of similar errors (the primed corrections) are a clear im- 
provement on the census. This exception is method A’ among the white 
males. However, for white males the B’ corrections give age ratios 
which are typically closer to unity than census ratios. 





* In preparing Figure 5 it was assumed that the native white and native nonwhite populations were 
subject to the same undercounts and overcounts as the total white and nonwhite populations. 

7 The very high sex ratios at ages 25-29 and 30-34 in the A corrections are the result of the large 
number of nonwhite male net immigrants estimated for those cohorts in the Current Population Reports, 
Series P-25, No. 98. Net immigration between 1940 and 1950 of nonwhite males who were 30-34 in 
1950 is estimated as 69 thousand. There are figures in the 1950 census which cast some doubt on the 
plausibility of immigration this extensive. The sum, in the 1950 census, of foreign-born nonwhites, plus 
nonwhite Puerto Ricans, plus Filipinos, at ages 30-34 (which would include the major portion of net 
nonwhite immigrants whenever they entered the country) is less than 15 thousand. 

8 It should be noted that it is an adjustment to the 1930 census correction at age 35-39 which has 
the effect of producing sensible age ratios at age 55-59 in the corrected population, and also that the 
adjustments are calculated from census data slone without reference to the corrections themselves. In 
other words, there was nothing in the adjustments employed in methods A’ and B’ which automatically 
insured an improvement in age ratios. 
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Fia. 6a. Age ratios in 1950 divided by adjusted age ratios for each cohort for 


according to corrections A and A’. 
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Fig. 6b. Age ratios in 1950 divided by adjusted age ratios for each cohort 
for the white and nonwhite populations, according to the Census of Population, 
and according to corrections B and B’. 
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Comparisons of Census Errors Estimated by the Method of Similar Pat- 
terns with Other Estimates 


Table 5 shows the undercount of males of draft age in 1940 (21-35) 
compared with the undercount calculated for males 20-34 by various 
versions of the method of similar error patterns. 


TABLE 5 


UNDERCOUNT AMONG MALES OF DRAFT AGE IN 1940 BY 
COLOR AS A PER CENT OF CENSUS POPULATION 
ACCORDING TO: 








Various versions of similar error hypothesis 





Errors ; 

Selective in 1930 Errors jee “ane ye _— to 
Service equal to | in 1940 40 or If " » whichever 
figures average | equal to oa 

of 1940 those in 
and 1950 2 
1950 A B 








White 4.5 ‘ 2.3 ' 4.9 4.3 
Nonwhite 18 26 23 20 
Total 5.8 , 4.7 ; 6.7 5.8 


























Sources: Estimated errors in the 1940 census for males 21-35 on the basis of Selective Service data 
taken from A. Ross Eckler and Leon Pritzker, “Measuring the accuracy of enumerative surveys” 
(paper presented at the 27th Session of the International Statistical Institute, New Delhi, 1951). 
U. 8. Bureau of the Census (processed), p. 2. 


Any of the methods using an assumption of similar error patterns 
with the exception of that which assumes 1940 and 1950 to be alike 
shows errors of roughly the same magnitude as do the selective service 
data. This comparison serves to reinforce the plausibility of a large 
undercount among young adult males. Young males are among the 
most mobile members of our society and the census methods of enumer- 
ation may well be particularly deficient in counting mobile persons. 

In Table 6 errors based on various versions of the similar error 
method are compared with errors derived from the Post Enumeration 
Survey. With regard to the white population, our methods yield much 
larger errors at the younger ages than does the PES, while at the oldest 
ages the PES indicates an undercount in contrast to the apparent over- 
count shown by the other methods. Among nonwhites our methods 
yield estimated errors uniformly larger than those of the PES. 
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It is clear from Table 6 that our methods and the Post Enumeration 
Survey can scarcely be considered as mutually confirming. The differ- 
ences between them can for the most part be explained, however, by a 
conjecture to the effect that persons of certain types who were missed 
by the census were also missed in the Survey.® This conjecture seems 
plausible enough in view of the similarities between census and Survey 
procedures. Although the Survey involved a careful recanvass of a 
sample of areas to obtain an estimate of households completely omitted, 
plus a reinterviewing in a sample of households to estimate the errone- 
ous inclusions and omissions within households; although the Survey 
was conducted only by enumerators chosen after aptitude and per- 
formance tests from among the regular census enumerators; and al- 
though these enumerators were given intensive supplementary training, 
nevertheless it is readily imaginable that, for example, persons who are 
mobile and relatively rootless would be missed by both canvasses. The 
regular Post Enumeration Survey did not attempt to cover the popula- 
tion residing in dwelling places occupied by more than 35 persons;® 
moreover, because the Survey was conducted 4 to 6 months after the 
census, it could hardly serve as an adequate check on transients. And 
if there were any tendency for nonwhite respondents to be reluctant 
(for example) to give information to strangers about (in particular) 
male members of the household, such reluctance would doubtless be as 
strong in a Post Enumeration Survey as in a census. In other words, 
the Post Enumeration Survey might be expected to reveal many 
(though not all) of the errors which could be attributed to poor training 
and gross carelessness on the part of the enumerators, and also many 
of the errors in reporting personal characteristics which are a result of 
obtaining information from a respondent other than the person enumer- 
ated. However, many errors caused by faulty memory or deliberate 
concealment, and omissions caused by such factors as impermanent 
residence would not be revealed by the Post Enumeration Survey. 


THE ESTIMATION OF ERRORS AT THE OLDER AGES 


The least reliable results of our method of estimating errors are at 
the older ages. These results may be distrusted on two grounds: first, 





* This possibility is suggested on page 7 in the introduction to Volume II of the 1950 Census of 
Population [9]. 

10 There was a “T-Night” coverage check on the enumeration of places patronized by certain 
transients conducted by the Bureau of the Census, but it could not parallel the regular Post Enumera- 
tion Survey procedure. The results of the coverage check have not been published; but it appears from 
preliminary figures that the check, when completed, would not more than begin to account for the dis- 
crepancies in Table 5. 





TABLE 6 


APPARENT ERRORS IN THE 1950 CENSUS AS A PER CENT OF 
THE CENSUS POPULATION, BY COLOR, SEX, AND BROAD AGE 
GROUPS, ACCORDING TO VARIOUS METHODS 
OF ESTIMATING ERRORS 








Method of Color and Age Age Age 
estimating error sex 0-24 25--44 45-64 





The Post Enumeration 
Survey White 
M 
F 
Nonwhite 
M 
F 


Version of Similar Error 
Hypothesis 
Errors in 1930 equal White 
to average of 1940 and M 
1950 errors F 
Nonwhite 
M 
F 


Errors in 1940 equal White 
to those in 1950 M 

F 
Nonwhite 

M 

F 


Errors in 1930 equal 
to those in 1940 or 
1950, whichever is less 
A White 
M 
F 
Nonwhite 
M 
F 
White 
M 
F 
Nonwhite 
M 
F 
White 
M 
F 
Nonwhite 
M 
F 16 
White 
M . ;  F 
F i 4. 
Nonwhite 
M 13 15 
F 10 9 21 























Source: Errors from the Post Enumeration Survey are adapted from preliminary unpublished 
results of the Survey. The final census report on the Survey has not been published. The final figures 
the Census Bureau derives from the Survey may differ from these unofficial data. 
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because some of the results themselves do not seem plausible, and sec- 
ond, because the method is inherently weakest at the older ages. 

The features of the estimates for older persons which seem least 
credible are: 

1. The apparent overcount of over 2 per cent for whites over 65. 

2. The sex ratio for nonwhites over 65 which ranges from .945 for 
method B’ to .986 for method A, whereas the sex ratio over 65 derived 
from life-table values is only .857. 

The overcount among whites over 65 seems unreasonable in part be- 
cause the PES shows an undercount of about 2 per cent among older 
whites. While one can readily understand that the PES would err by 
missing persons also missed in the census, it is hard to see how the 
Survey would find a fictitious undercount. 

The inherent weakness at the older ages of our method of estimating 
errors is that the recursive technique employed leads to the possibility 
of cumulating at the older ages the errors introduced by spurious data 
or inaccurate assumptions. Thus the systematically greater deficits 
among whites we have detected in the 1940 census lead to especially 
bad results in the older ages. Also a fixed underregistration of deaths 
would have its greatest effect on estimates for the older ages. 

In Appendix II another possible explanation of the deficiencies in the 
older age error estimates is explored. This explanation is that the com- 
pleteness of enumeration at the ages above 50 has improved since 1930. 
Our method assumes that the 1930 census was just as complete as the 
1950 census; if in fact the 1930 census was less complete at older ages, 
the correct count in 1950 would be underestimated by our method. 

Because of the weaknesses in the error estimates for older ages by our 
method, we will in our final estimate of the errors in the 1950 census 
use the Post Enumeration Survey figures for persons over 65. 


FINAL ESTIMATE OF ERRORS IN THE 1950 CENSUS 


The figures we have selected as the best estimate of the 1950 popula- 
tion are presented in Table 7. For persons under 65, we have taken our 
figures from method B’. The Post Enumeration Survey results have 
been used for persons over 65. 

These particular corrections have been chosen because they are bet- 
ter in certain details (such as age and sex ratios) than other corrections 
generated by other versions of our method. We have good reason for 
rejecting the hypothesis that 1940 errors were at the same level as 
1950, but there is no real rationale for assuming minimal errors in 1930. 
Also—as already stated—inaccurate estimates of cohort change may 











44 AMERICAN STATISTICAL ASSOCIATION JOURNAL, MARCH 1955 


have made our corrections wide of the mark. Nevertheless we believe 
that our estimates give the right order of magnitude to the errors in the 
1950 census. 

The corrections we have selected indicate that there were about 5 
million persons more in the United States in 1950 than were counted by 


TABLE 7 


THE TOTAL POPULATION RESIDING IN THE U. 8. AS OF APRIL 1, 
1950, BY AGE, COLOR, AND SEX, ACCORDING TO THE 
CENSUS AND ACCORDING TO ESTIMATED 
CORRECTIONS 


(In thousands) 



































White Nonwhite 
Males Females Males Females 
Age 

Error Error Error | Error 

Cen- Corr. | as % Cen- Corr. | as % Cen- Corr. | as % Cen- | Corr. | as % 

sus | pop. |ofcen-| 5% | pop. | ofcen-| 8 | pop. | ofcen-| 58 | pop. | of cen- 

count ous count ous count gus count ous 

0-4 7,244 | 7,570 | 4.5 | 6,940 | 7,200] 3.8 992 | 1,100 11 987 | 1,090 10 
5-9 5,915 | 6,100 | 3.1 | 5,681 | 5,820) 2.5 799 890 12 804 880 10 
10-14 4,945 | 5,000 1.1 4,750 | 4,800 1.1 716 760 7 709 760 7 
15-19 4,686 | 4,800 | 2.4 | 4,645 | 4,720 1.5 626 740 18 661 740 12 
20-24 5,003 | 5,190 | 3.7 | 5,176 | 5,270 1.8 604 720 19 699 760 8 
25-29 5,350 | 5,540 | 3.5 | 5,575 | 5,640 1.3 622 750 20 695 750 8 
30-34 5,081 | 5,280 | 3.8 | 5,276 | 5,350 1.3 544 660 20 617 680 9 
35-39 4,956 | 5,030 | 1.5 | 5,103 | 5,050 |—1.0 562 630 12 626 650 3 
40-44 4,574 | 4,680 | 2.3 | 4,617 | 4,750] 2.9 497 600 20 517 610 18 
45-49 4,080 | 4,140) 1.5 | 4,089 | 4,200} 2.7 446 500 13 455 516 12 
50-54 3,756 | 3,800 1.3 | 3,779 | 3,860 | 2.1 373 420 11 364 420 16 
55-59 3,351 | 3,500 | 4.6 | 3,345 | 3,600| 7.6 279 330 17 260 340 30 
60-64 2,829 | 2,880 | 1.9 | 2,8" | 2,980] 5.4 208 260 24 198 270 36 
65+ 5,360 | 5,470 | 2.0 | 6,014 | 6,120 1.8 437 490 12 459 480 5 
Total 67,129 |68,980 | 2.7 |67,813 |69,360 | 2.3 | 7,704 | 8,830 15 | 8,051 | 8,930 ll 


























Grand Total Census 150,697. 
Grand Total Corr. Pop. 156,100. 
Error as % of Census Pop. 3.6. 
Source: Corrections for ages 0-64 are estimated by the method described in the text as B’. For persons over 65, the 
corrections are adapted from unpublished preliminary results of the Post Enumeration Survey. 
The corrected population is rounded to the nearest 10,000. 


the census. The number of males was understated by about 3 million 
and the number of females by about 2 million. The white population 
was apparently subject to an undercount of nearly 2.5 per cent and the 
nonwhite population to an undercount between 12 and 13 per cent. 
The largest understatements of numbers were among young children, 
young adult males, and persons aged 55-59. 
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It must be emphasized that the apparent deficits (and overstate- 
ments) presented in Table 6 are a result of the combined effects of 
omission and misclassification. In other words, a table listing apparent 
deficits for the various age-sex-color groups does not tell us character- 
istics of persons omitted by census takers. It appears, for example, that 
persons 55-59 were omitted in especially large numbers. A more likely 
explanation for the large deficit at ages 55-59 is that there is a general 
tendency to underestimate or overestimate the age of persons 55 to 59 
years old. Methods of detecting errors which compare gross totals 
rather than individual records—the method we have employed is of this 
sort—fail to reveal directly whether a given deficit arises from omission 
or misclassification. 

The large apparent deficits among nonwhites pose an interesting 
question: are these deficits the result of omissions of nonwhites or the 
result of misclassification of some sort? In this instance “misclassifica- 
tion” is perhaps an inexact term, since there are no clearly correct 
standard categories of white and nonwhite. The judgment of race 
(“RACE, White, Negro, American Indian, Japanese, Chinese, Filipino, 
Other race—spell out” is the heading on the enumerator’s form) in the 
census was left to the enumerator. He was not instructed to ask the 
race of the respondent. He was told to assume the race of related per- 
sons in the same household to be the same as that of the respondent; 
and he was instructed to ask about race only for unrelated persons. 
Hence white persons are persons classified as white by a census enumer- 
ator, and nonwhite persons are persons classified as Negro, American 
Indian, Japanese, Chinese, or Filipino by the enumerator. Thus “mis- 
classification” with respect to color might better be termed “inconsist- 
ent classification” of the same person as between one census and 
another, as between a birth certificate and a census form, or as between 
the entry made by a Selective Service Board and by a census enumera- 
tor. 

It is clearly possible for a person to be classed as nonwhite in one 
census and as white in later censuses. This change might arise because 
of a reduction in segregated neighborhoods, or because of heavy migra- 
tion from the South. A person classified by an enumerator as nonwhite 
while living in a segregated neighborhood might well be classified sub- 
sequently as white when living elsewhere. Similarly, persons designated 
as Negro in the South might later be classified as white by a northern 
enumerator. 

These observations on the possibility that persons may be reclassified 
from nonwhite to white from one census to another should not be con- 
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strued as equivalent to observations on the phenomenon known as 
“passing.” Changes in classification in successive censuses would reflect 
nothing more than different judgments by census enumerators, and 
have no necessary connection with the classification generally made in 
the community, or even with the classification each person would make 
for himself. Our estimate of the error in counting the total population is 
not to any important degree affected by whether or not there have been 
inconsistencies in color classification from census to census. For if some 
nonwhite persons were missing because of reclassification, then the 
population designated “white” in the census must have gained recruits 
by the same reclassification. Hence for every missing nonwhite that we 
explain by reclassification, there is a newly missing white that can be 
explained only by omission. 


SIGNIFICANCE OF THE ERRORS 


The importance of having a correct enumeration of the population 
depends, of course, on what use one has in mind for the numbers. It is 
obviously impractical to attempt a review of all of the uses of census 
population data, and to try to comment on the significance of errors in 
each use. We will limit our remarks to a few comments on the kinds of 
conclusions likely to be affected and the kinds that are not. 

The error of over 3 per cent in the total number of persons would 
only seldom have important consequences. People given to referring to 
163 million Americans would be more accurate if they spoke (as of the 
beginning of 1955) of 168 million Americans, but their conclusions 
would rarely be affected. 

A common use of the total number of persons is in deriving per capita 
figures—from birth and death rates to average number of movies at- 
tended. Presumably an understatement of the population total by 3 per 
cent will cause such ratios to be 3 per cent too large. However, in very 
many instances if the truth were known the unassessed error in the 
numerator of per capita figures would exceed 3 per cent. We can give 
one example where the numerator has been subject to verification— 
namely, the birth rate, where the numerator is the number of births. 
The preliminary crude birth rate for the United States in 1950—the one 
listed in such authoritative places as the 1950 Demographic Yearbook of 
the United Nations—is derived from census populations and registered 
births. The underregistration of births according to the test conducted 
in conjunction with the 1950 census was a little over 2 per cent [11]. 
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Thus the birth rate is subject to two nearly equal (and, it so happens, 
compensating) errors." 

The errors in the population count more apt to be consequential are 
those for certain groups, particularly when they reach levels several 
times the error in the total. For example, if the corrections we have 
derived are proper, and if deaths are equally well registered for males 
and females, part of the difference between male and female mortality 
rates is spurious. If we adjust the 1950 life table for underenumeration 
of the base population, the difference in expectation of life at age 5 be- 
tween white males and white females becomes 5.4 instead of 5.5 years,” 
and the difference in é, between nonwhite males and females becomes 
2.6 instead of 3.4 years. 

The most striking errors—and the ones which seem most likely to 
“make a difference’’—are the rates of underenumeration among non- 
whites. If we interpret these as representing nonwhite persons omitted, 
we are led to such views as that (for instance) the sections of the 
country with a high per cent of nonwhites contain a larger fraction of 
the total population than the census shows. Assuming that nonwhite 
deaths are as completely registered as white, we would conclude that 
part of the difference in white and nonwhite death rates—to return to 
mortality statistics—is spurious. The corrected expectation of life at 
age 5 for nonwhite males is 4.2 years rather than 6 years less than that 
for white males. 

Finally, the existence of errors in the census could lead to inappropri- 





ul Thus, ironically enough, the preliminary birth rate based on registered births (and the census 
population) is probably closer to reality than the birth rate based on adjusted birth figures. 

12 Life table taken from U. S. National Office of Vital Statistics [10}. 

The effect on expectation of life of correcting the size of population is small because the large cor- 
rections occur at ages (e.g., 20-34) where mortality rates are low. It is interesting to note that the differ- 
entials which result from correcting the life table for errors in enumeration are closer to the differentials 
among the industrial policy holders of the Metropolitan Life Insurance Company. The mortality ex- 
perience of the industrial policy holders is no doubt non-representative of the U.S. population in various 
ways. However, one would expect the enumeration of these policy holders by the Metropolitan to be 
relatively accurate. « 


DIFFERENCES IN és 











‘ White females— Nonwhite females— White males— 
Life table . ; e 
white males nonwhite males nonwhite males 
U.S., 1950 5.5 3.4 6.0 
U. S., 1950, corrected for under- 
enumeration 5.4 2.6 4.2 
Metropolitan Life Insurance Co. 
Industrial Policy Holders, 1950 5.3 2.4 2.6 














Data on industrial policy holders are taken from Metropolitan Life Insurance Company [3]. 
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ate “blow-ups” of sample figures in estimating national totals—inap- 
propriate because the total population used is too small, and inap- 
propriate because the relative size of age-sex-color groups is misappre- 
hended. Only infrequently, I suspect, would errors from this source be 
an important factor compared to sampling variability and the errors of 
measurement within the sample. 


» * * 


The results of a discussion which has been rather complicated in 
places can be summarized in a few sentences. 

1. The corrected total of 156 million persons is closer to the number 
of persons in the United States in April 1950 than the census figure of 
about 151 million. 

2. It is our impression after surveying the evidence that the esti- 
mated deficit of about 5 million is, if anything, somewhat conservative. 
However, if we made less conservative assumptions—assuming average 
rather than minimal 1940-1950 errors in 1930—the estimated deficit 
would be increased by no more than 10 per cent. 

3. The general pattern of age-by-age errors indicated in Table 6 is 
probably reliable, though the results seem less trustworthy at the higher 
ages. 

4. There is a possibility that part of the large apparent deficit among 
the nonwhite population is caused by inconsistencies in classifying per- 
sons by color. If such were known to be the case, the estimated total 
error would hardly be affected, but the estimated number of nonwhite 
persons omitted would be reduced, and the estimated number of white 
persons missed would be increased. 

5. The importance of census errors can be judged only when it is 
known how the figures are used. It seems likely, however, that the dif- 
ferential errors for various subgroups will be more important than the 
error in the total. 

A final comment about the difficulties of improving the accuracy of 
the census count may be in order. The census of population obtains a 
large amount of information beyond the mere count of persons. As a 
general rule, methods which would improve the accuracy of counting 
would cost more money, and within a fixed budget the higher costs of 
accuracy would mean less information gathered. It would be costly to 
extend the training of enumerators and only a moderate improvement 
would result (the meagerness of the improvement is implied by the 
small errors discovered by the Post Enumeration Survey). A large re- 
duction in errors would probably require a drastic (and no doubt ex- 
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pensive) change in procedure, such as adding a thorough de facto can- 
vass to the present de jure type of census and cross checking the results 
name by name. The institution and subsequent improvement within 
the past twenty-five years of nationwide vital statistics have made it 
possible for the first time to verify census accuracy. With such a verifi- 
cation now possible—this article is a crude example—greater accuracy 
in determining numbers of persons may become available at moderate 
cost. 


APPENDIX I 


ESTIMATING DIFFERENCES IN 5-YEAR AGE-HEAPING 
IN THE 1930, 1940, AND 1950 CENSUSES 


The job at hand is to estimate the surpluses and deficits in five-year 
age groups in the 1930, 1940, and 1950 censuses attributable to “age- 
heaping.” The general principle of the method we use is to ascertain the 
ratio of the number in a given age group to the number expected from a 
“smoothed” age distribution; to ascertain the ratio of the average num- 
ber in the cohort as enumerated in all censuses to the number expected 
from a “smoothed” distribution of such averages; and to divide the 
former ratio by the latter and consider the answer a reflection of age- 
heaping. In other words, an age group deviates from a “smoothed” 
value, we assume, because of age-heaping and because the cohort itself 
deviates from a smoothed value. The latter factor is minimized by using 
the ratio of a cohort average in all censuses to the smoothed value of 
such averages as a divisor. 

The actual procedure employed for each sex and color in each census 
was: 

(a) Determine the total deviation of each 5-year age group from a 
smoothed value, by dividing the number in each age group by 
the three-term moving average centered on the group. 

(b) Determine the first approximation to the deviation due to cohort 
size by dividing the average reported size of the cohort in all 
censuses in which it was enumerated by the three-term moving 
average of such cohort averages. 

(c) Since, however, the average reported size of a cohort may deviate 
from the smoothed value because of persistent age-heaping at 
every enumeration, the first approximation to the deviation due 
to cohort size was adjusted as follows: 

i. Manufacture a synthetic population consisting of the sum of 
the persons enumerated in each group in 1930, 1940, and 
1950. 
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Average the numbers in this synthetic population for the 
ages at which each cohort enumerated in the three censuses 
were enumerated in any census (e.g., ages 5-9, 15-19, and 
25-29 for the cohort aged 25-29 in 1950). 
1t. Divide these average numbers by a three-term moving aver- 
age to estimate typical divergence from a smoothed value of 
an average cohort size due to persistent age-heaping. 
The adjusted estimate of deviation due to cohort size was 
then obtained by dividing the first approximation (see (b) 
above) by the typical divergence due to persistent heaping 
(see (c) 277 above). 
Then the total deviation of each age group (see (a) above) was 
divided by the adjusted estimate of deviation due to cohort size 
(see (c) iv above). The result was considered a rough measure of 
the extent of age-heaping. 
The differences in age-heaping among the three censuses are shown 
in Table 8. 


TABLE 8 


DIFFERENCES BETWEEN THE 1950 AND 1930 CENSUSES AND 
BETWEEN THE 1940 AND 1930 CENSUSES AS TO THE PER CENT 
ERROR IN REPORTING EACH AGE GROUP 
CAUSED BY AGE-HEAPING 








1950-1930 1940-1930 
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® At ages 10-14 and 15-19 the difference between the 1950 and 1930 censuses was assumed equal 
to twice the difference between the 1940 and 1930 censuses. 


APPENDIX II 


CHANGES IN THE COUNT OF OLDER PERSONS IN RECENT CENSUSES 


A conceivable source of weakness in our estimates of census errors in 
counting older persons is the possibility that the completeness of 
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enumeration in 1930 was not (as assumed) the same as in 1940 or 1950. 
As a means of determining whether and to what degree enumeration at 
the older ages has been changing, we can compare over several decades 
the fraction of various cohorts that appear on the basis of census data 
to survive a decade with the fraction apparently surviving on the basis 
of mortality data. The enumerated population aged 60 to 64 in 1930 
divided by the enumerated population 50 to 54 in 1920 indicates the 
apparent fraction of this cohort surviving the 1920’s, while the ratio of 
sL60 to 550 from a United States life table for the 1920’s gives the same 
ratio derived from mortality data. But the census apparent survivor- 
ship is contaminated, so to speak, by migration; to the degree that a 
cohort gains or loses by international migration, there should be dis- 
agreement between census and life table ratios. The effect of migration 
can be mitigated, however, by forming the ratio of the survivorship of 
two adjacent cohorts. Such a ratio should be the same whether derived 
from census or mortality records, provided that, for example, a cohort 
aged 50-54 at the beginning of a decade loses or gains through migra- 
tion the same fraction as the cohort aged 55-59. The ratio of the appar- 
ent survivorship of adjacent cohorts derived from successive censuses 
is divided, in Table 9, by the same ratio derived from mortality data. 
Thus if we call (5S.). the fraction of a cohort aged z to x+5 at the 


beginning of a decade surviving 10 years on the basis of census data, 
and (sS.)m the fraction surviving on the basis of mortality data, the 
ratio R, presented in Table 9 is given by: 


7 (sSz+5)c (5S245)m 


R, = 





(sSz)e (5Sz)m 


This tabulation has a number of revealing features: 

1. R, diverges substantially from 1.00 in the United States, Canada, 
England and Wales, and the Union of South Africa, but is within 1 per 
cent of unity in Norway, Sweden, and Switzerland. These last countries 
are known on other grounds to have accurate census data. The fact 
that R, is so close to one in these countries, then, indicates that R, 
values do indeed reflect the presence of census errors (though of course 
R, values other than unity could also be caused by erroneous mortality 
data). 

2. There are certain persistent relations among R, values for United 
States whites, 1890 to 1950, for Canada, England and Wales, and the 
Union of South Africa. Rs with no exceptions is the largest ratio, and 
either Ryo or Rss is the lowest ratio with the exception only of England 
and Wales. 
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TABLE 9 


Rz FOR VARIOUS AGES, MALE AND FEMALE, IN 
VARIOUS COUNTRIES 












































Males Females 
Country and decad 
40 45 50 55 40 45 50 55 

U. 8. A.— Whites 

1880-1890 -882 1.037 1.084 -958 -913 1.065 1.084 941 

1890-1900 -917 1.003 1.125 -917 -958 1.011 1.110 914 

1900-1910 .927 1.004 1.104 -967 -946 1.036 1.063 971 

1910-1920 -931 1.020 1.072 -975 -951 1.035 1.065 978 

1920-1930 .883 1.039 1.079 -981 -947 1.025 1.066 980 

1930-1940 -959 1.001 1.090 -987 -964 1.020 1.098 967 

1940-1950 -972 1.005 1.073 -978 -970 1.005 1.100 956 
U. 8. A—Nonwhites 

1880-1890 -761 1.074 1.236 -738 -799 1.217 1.367 775 

1890-1900 -726 1.130 1.168 -828 -769 1.269 1.233 884 

1900-1910 .748 1.067 1.196 -808 825 1.162 1.224 856 

1910-1920 -714 1.064 1.192 -841 -768 1.220 1.215 -923 

1920-1930 -584 1.130 1.186 -821 -783 1.161 1.241 -862 

1930-1940 -821 -915 1.643 -827 860 1.069 1.816 -753 

1940-1950 -921 -970 1.436 - 864 906 1.047 1.663 779 
Canada 

1931-1941 -974 1.008 1.092 -985 -977 1.016 1.078 1.000 
England and Wales 

1921-1931 .970 1.005 1.031 1.007 -976 1.016 1.051 1.018 
Union of South Africa 

1921-1931 -952 1.045 1.053 -958 -958 1.025 1.100 -941 
Norway 

1920-1930 .994 1.008 1.005 1.004 995 -998 1.007 1.009 
Sweden 

1900-1910 1.003 .999 1.000 1.005 997 1.002 -996 1.003 

1910-1920 .998 -999 1.005 1.008 1.004 -996 1.002 1.008 

1920-1930 1.005 -999 1.003 1.007 1.001 1.001 1.001 1.001 

1930-1940 -997 -999 1.002 -998 1.000 -998 1.004 -995 
Switzerland 

1920-1930 .999 -999 1.006 -999 | 1.000} 1.002 1.002 | 1.000 
Sources: 


Population data for the United States. U. 8. Bureau of the Census, U. 8. Census of Population, 
1950. Vol. II. Characteristics of the Population, Table 39. 

Life table data for the United States from official life tables for the United States, for 1901, 
1910, 1920-29, 1930-39, and 1945. 

Population data for foreign countries taken either from census publications (for the Union of 
South Africa, England and Wales, and Canada) or from statistical yearbooks (for Norway, Sweden, 
and Switzerland) of the country in question. 

Life table data for foreign countries taken from United Nations, Demographic Yearbook, 1948, 

Tables 33 and 34. 


Re = (sSz4s)c0 / (sSz4s)m 





, where sSz represents the fraction of a cohort aged z to r+5 at the be- 


(Sz)e (sSz)m 


ginning of a decade that survives. (sSz)- represents the fraction surviving according to census data, 
(sSz)m the fraction surviving according to life table data. 











eS ee aa ee a Clr 


Swe wee eS 


13 


1 
5 


on, 
01, 


. of 
en, 








POPULATION OF U.S8., 1950——-REVISION OF CENSUS FIGURES 53 





COMPLETENESS OF ENUMERATION 
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Fig. 7. Apparent completeness of enumeration of persons between the ages 
of 55 and 75 in U. S. censuses from 1890 to 1950 (based on a comparison of 
census and life table survivorship ratios) by sex and color. 
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The R, values do not as they appear in Table 9 reveal the time pat- 
tern of erroneous enumeration in the United States censuses. However, 
if we assume uniform enumeration up to age 50—54 in all censuses, and 
assume that life table values yield an accurate estimate of ratios of 
cohort survivorships, it becomes possible to calculate the apparent com- 
pleteness of enumeration of white males and white females at ages 55- 
59 through 70-74 for the censuses from 1900 to the present. 

The results of such calculations are presented in Figure 7. The level 
of underenumeration in Figure 7 cannot be taken seriously (being based 
on an assumption for white males of no error in enumeration in all 
censuses at ages 40-44 and 50-54, and of about 2 per cent under- 
enumeration at ages 45-49), but changes probably are significant. The 
fact that 1930 is shown to be more underenumerated than 1940 and 
1950 for all of the ages considered—and especially so for the males—is 
striking. 
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ELEMENTS OF SYMMETRY IN THE SKEWED 
INCOME CURVE* 


HERMAN P. MILLER 
Bureau of the Census 


The thesis of this paper is that the skewness of income dis- 
tributions is largely due to merging several symmetrical dis- 
tributions which differ primarily with respect to level and 
dispersion. Much of the skewness of the income distribution is 
due to the inclusion of women. Even for men, there is con- 
siderable symmetry in income distributions when occupational 
groups are considered individually. The “tail” of the income 
distribution largely includes men employed as independent 
professionals, businessmen, or managers. To the extent that 
there is freedom of entry into these occupations, income 
differences between these groups and others may merely repre- 
sent the payment by society for rare skills or services. The 
facts regarding freedom of entry are not now adequately 
known. 


UMEROUS attempts have been made to construct theoretical sys- 
tems which would account for the skewness of the income curve. 
These theories can be traced as far back as the latter part of the nine- 
teenth century, and they persist with full vigor to the present day. In 
general, theories of the distribution of personal income can be classi- 
fied as either “natural” or “institutional.” The natural theories gen- 
erally take a mathematical form and they attempt to explain income 
distribution in terms of models which are generated by the theory of 
probability. Although some of these theories take institutional factors 
into account, the great majority regard income distributions as the re- 
sult obtained by a play of natural forces outside the economic system; 
forces which would persist regardless of the particular form of economic 
organization. These theories range from a view of the income curve as 
a joint probability distribution of biological traits transmitted by 
heredity,' to the analysis of income distribution in terms of a game of 
chance played under certain specified conditions.? 
In contrast to the natural theories, the institutionalist approach 
tends to be nonmathematical in nature. The theories based on this ap- 





* This article is an adaptation of part of a chapter from the 1950 Census monograph, Income of 
the American People, to be published soon by John Wiley and Sons, Inc. This monograph was prepared 
under the joint sponsorship of the Social Science Research Council and the U. S. Bureau of the Census. 

1 Carlos C. Classon, “Some social applications of the doctrine of probability,” Journal of Political 
Economy, 7 (March 1899). 

2 Maurice Fréchet, “Nouveaux essais d’explication de la répartition des revenus,” Revue de L'Insti- 
tut International de Statistique, 13 (1945). 
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proach typically do not seek a model which can produce the skewed 
income curve. Rather, they try by means of descriptive analysis to ex- 
plain income inequality in terms of the institutional setting. These 
theories recognize the importance of certain natural phenomena like 
the distribution of ability, chance, etc. However, their unique stamp 
is the emphasis they give to institutional arrangements like the in- 
heritance laws which they claim help to perpetuate inequality once it 
is established. Thus, for example, Taussig typifies this view when he 
attributes the origin of inequality to differences in innate ability and 
the perpetuation of inequality to “the influence of the inheritance both 
of property and of opportunity,” as well as to the biological transmis- 
sion of “native ability.”* Pigou states the issues even more succinctly. 
In his Economics of Welfare, he asks why the curve of income is skewed 
when “there is clear evidence that the physical characters of human 
beings—and considerable evidence that their mental characters—are 
distributed on an altogether different plan,” i.e., a normal curve.‘ 
Pigou first considers the possibility that the skewed income curve 
largely reflects the merging of nonhomogeneous groups, each of which 
can be characterized by a non-skewed curve. However, he ultimately 
discovers “a more important and more certain explanation”; namely, 
“Income depends not on capacity alone, whether manual or mental, 
but on a combination of capacity and inherited property. Inherited 
property is not distributed according to capacity, but is concentrated 
upon a small number of persons.”> Pigou thus attributes the major ex- 
planation for the skewness of the income curve to institutional factors 
which bestow an undue advantage upon the wealthier classes in society. 

The evidence presented below for the United States suggests that the 
skewness of the income curve reflects the interplay of many forces and 
that it would be unwise and perhaps incorrect to overstress the im- 
portance of any single factor. This is not to say that the statistical 
evidence either supports or contradicts Pigou’s conclusions. Like all 
statistics, these data are subject to interpretation and they can be read 
in such a way as to find them consistent with either the natural or the 
institutional theories. Perhaps the chief value of the empirical evidence 
is that it focuses attention on the characteristics associated with the 
component parts of the income curve and permits the formulation of 
more specific hypotheses regarding the controlling factors in the deter- 
mination of personal income distribution. 





*F. W. Taussig, Principles of Economics (New York: Macmillan Company, 1920) Second Edition, 
p. 246. 

4 A.C. Pigou, Economics of Welfare (London: Macmillan and Co.), Third Edition, 1929, p. 648. 

§ Ibid., p. 649. 
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TABLE 1 


DISTRIBUTION OF THE TOTAL CIVILIAN NONINSTITUTIONAL 
POPULATION OF THE UNITED STATES BY 
TOTAL MONEY INCOME: 1951 




















Number of 
Total money income persons Per cent 
(thousands) distribution 
cg a a0o% cn sane tae ana cutnnae 151 ,532 100.0 
IT CTT Per eT ETA 287 0.2 
MN yoo 5. <a -ac0 Kare Sens me peat 79 ,786 52.7 
a Ss acd. Lepore gaiharaanati aera orgeesal atta 11,553 7.6 
I aca. asd.5s acs Ga as lee eo 8,968 5.9 
er ere 5,955 3.9 
8 rear 6,314 4.2 
RO ere eer ee ere 7,246 4.8 
ee Os o.os oe cos cas cece annas 6,385 4.2 
0 PO OP Is ooo ccs nse cneeee 6,959 4.6 
oe ee re 5,309 3.5 
Oe ra ae 3,946 2.6 
ear area 2,296 1.5 
I cc oe ie cae stew wees 3,013 2.0 
Be Oe kc cect eccancaces 1,363 0.9 
| re 1,220 0.8 
$10, GUO te SIS, G00... . 5 ccc cece scns 502 0.3 
i 430 0.3 
1 Includes the following groups: 
i IE TG a. o 6. 50:0 6:9: 6:6 106 o5.0.0:8:0.616,6:0:50:4.0:0:5:65.4:000400004400600 42,204 
Persons aged 14-19, most of whom were attending high school or college.......... 7,702 
OL PESO Ey Ree ee Peery ere ee ere ee 25,609 
Persone aged 65 and over (excluding housewives)...........seseeeceescecsecsens 1,077 
TEAL PRS CELL Ces PECL ET OE OLE TE OOO TON TONNE TEE Te Tee: 3,194 


Source: Estimated population based on unpublished data of the Census Bureau. Income dis- 
tribution derived from the Census Bureau report, Current Population Reports—Consumer Income, 
Series P-60, No. 11, table 1. 


COMPONENT PARTS OF THE INCOME CURVE 


In April 1952 there were about 1513 million civilians residing in the 
United States. An examination of the distribution of these people by 
their own incomes indicates that the income curve is very much as 
Pigou described it, “humped and lopsided.” The very striking fact 
about Figure 1 is that about one-half of the people in the United States 
received no cash income at all during the calendar year 1951. If persons 
who did unpaid work on the family farm or business were included as 
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income recipients the proportion without income wouid be reduced only 
slightly since the maximum number of people who were so employed 
during any given month in that year was less than 2} million. The pro- 
portion without income may be somewhat overstated because of the 
difficulty of correctly allocating the income within a household and be- 
cause of the tendency for people to forget to report small amounts of 
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Fic. 1. Per cent distribution of the civilian noninstitutional population 
by total money income, for the United States: 1951. 


interest, dividends, and similar types of income to interviewers. How- 
ever, even if all of the imperfections of measurement could be elimi- 
nated, it is doubtful that the conclusion that only one-half of the popu- 
lation in this country receives money income would have to be changed 
significantly. 

There are some who will object to the presentation of the statistics as 
they are shown in Figure. 1. They will correctly point out that the 80 
million people without income include 42 million children under 14 
years of age and 8 million additional persons from 14 to 19 years old, 
most of whom are going to school. The figure also includes 26 million 
housewives and 1 million aged persons. Surely these groups should be 
treated separately if the aim is to understand the forces which shape 
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TABLE 2 


PER CENT DISTRIBUTION OF CIVILIAN NONINSTITUTIONAL 
INCOME RECIPIENTS 14 YEARS OF AGE AND OVER BY TOTAL 
MONEY INCOME, BY SEX, FOR THE UNITED STATES: 1951! 








Both 


Total money income onane Male Female 





Number of persons with income 
(thousands) 71,746 46 ,572 25,174 


Per cent 100.0 100.0 100.0 
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$1 to $499 

$500 to $999 
$1,000 to $1,499 
$1,500 to $1,999 
$2,000 to $2,499 
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$3,000 to $3,499 
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DCHMSMANAOSO 
NOY OW » 
WOCRMHAMMWOS 
PORMDMOANO 
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£3 ,500 to $3,999 
$4,000 to $4,499 
$4,500 to $4,999 
$5,000 to $5,999 
$6,000 to $6,999 
$7 ,000 to $9 ,999 
$10,000 to $14,999 
$15,000 and over 
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ocr RP WO 
OnNnN ON N Oe 
or NN OS & COO 
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2.2 
0.9 
0.5 
0.5 
0.2 
0.2 
0.1 
0.1 











1 To facilitate graphic presentation, all per cents shown in Figure 3 are based on the total of both 
sexes (71,746) rather than on the individual column totals which were used in this table. 

Source: Derived from U.S. Bureau of the Census, Current Population Reports—Consumer Income, 
Series P-G0, No. 11, Table 4. 


income distribution. There is considerable dispute as to who should be 
included in a measure of income distribution. It has been argued that 
a measure of income inequality should refer to the entire population 
rather than to a segment of it.6 The major reason for including all of 
these groups in the first approximation of the income curve is to provide 
an inventory of the entire population with respect to income. 





6 “In brief, no group less extensive than the total population should be the base for studying changes 
in the equality of incomes. Morris A. Copeland came to this conclusion in Recent Economic Changes in the 
United States, and Kuznets has stressed its logic in ‘National Income,’ Encyclopedia of Social Sciences, 
and National Income: A Summary of Findings.” The above quotation appears in an article by Dorothy 
S. Brady, “Research in the Size Distribution of Income,” Studies in Income and Wealth (New York: 
National Bureau of Economic Research) Volume 13, p. 10. 
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Even if the analysis of income distribution is restricted to income re- 
cipients, it is apparent from Figure 2 that the income curve remains 
much the way Pigou described it. Over one-fourth of the income re- 
cipients are clustered in the relatively narrow range of incomes between 
$1 and $1,000. If this range is doubled, nearly half of all income recipi- 
ents can be accounted for. 

It can be noted in Figure 2 as well as in the other figures that the 
curve for both extremes of the distribution has been fitted with a dotted 














TOTAL MONEY mecChan: 


Fig. 2. Per cent distribution of income recipients by total money income, 
for the United States: 1951. 


line rather than with the solid one which has been used for the rest of 
the distribution. The dotted line has been used to indicate that the 
data for this part of the curve represent only rough approximations. 
For example, in Figure 2, the only information available for the “Loss” 
interval and the “$15,000 and over” interval is that they respectively in- 
clude 0.4 per cent and 0.6 per cent of the income recipients. Since these 
intervals comprise only a small area of most of the distributions for 
which data are presented, this liberty with the figures has the advantage 
of indicating the shape of the complete curve without significantly af- 
fecting any of the conclusions. 
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One of the most important sources of heterogeneity in the income 
curve is the inclusion of both men and women in the same distribution. 
It is desirable for several reasons to show income distributions sepa- 


TABLE 3 


PER CENT DISTRIBUTION OF FEMALE INCOME RECIPIENTS 
14 YEARS OF AGE AND OVER BY TOTAL MONEY INCOME AND 
EXTENT OF EMPLOYMENT, FOR THE UNITED STATES: 1949! 








Worked less than 50 weeks 
or did not work at all 


Worked in 1949 


50 weeks 


or more 
3 Worked 
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Tota] money income 





Number of women with 
income (thousands).... 9,348 
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1 To facilitate graphic presentation, all per cents shown in Figure 4 are based on all female income 
recipients (22,272) rather than on the individual column totals which were used in this table. 


Source: U. 8. Bureau of the Census, U. S. Census of Population: 1950, Vol. II, Characteristics of the 
Population, Part I, United States, Table 141. 





rately for each sex. Although income from employment represents the 
most important source of receipts for both men and women, the labor 
force behavior of women is markedly different from that of men. In our 
society it is customary for the man to provide for his own support and 
for that of his family. Accordingly, practically all able-bodied men in 
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the productive age groups become full-time workers and develop per- 
manent attachments to the labor force. In contrast, it is customary for 
women to have the primary responsibility for home management. Dur- 
ing any given month, roughly three-fourths of the married women do 
not engage in any paid economic activity and only one-fourth are in 
the labor force as either paid workers or unpaid workers in their family 
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Fia. 3. Per cent distribution of income recipients by total money income, 
by sex, for the United States: 1951. 


farm or business.? Nevertheless, married women comprise about one- 
half of the female labor force.* This means among other things that the 
majority of women workers can typically accept only intermittent em- 
ployment and at jobs which interfere least with the fulfillment of their 
prime responsibility. Moreover, even when the working woman is not 
a housewife, she frequently limits herself to certain types of jobs. For 
these and many other reasons, it is important in the analysis of income 
distribution that the data for men and women be studied separately. 

Figure 3 presents the separate income distributions for men and 





7U. S. Bureau of the Census, Current Population Reports—Labor Force, Series P-50, No. 39, 
Table 1. 
8 Ibid., Table 1. 
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women. It will be noted that both curves are skewed and show a pro- 
nounced tendency toward bimodality. Neither curve resembles the 
“symmetrical curve shaped like a cocked hat” which Pigou sought. In 
terms of the problem which Pigou set out to analyze, it should be noted 
that the skewness of the income curve for women can largely be ex- 
plained in terms of the factors noted above rather than by reference to 
inheritance. In interpreting Figure 3 and the subsequent figures which 
will be discussed, it should be noted that the combined area under both 
curves represents unity. The area under the curve for males represents 
nearly two-thirds of the total area under both curves because nearly 
two-thirds of all income recipients were men. These curves as shown, 
therefore, accurately reflect the weights attached to each of the com- 
ponents of the distribution. 

One important fact to note about female income recipients is that 
when the distributions for full-year workers are separated from the 
distributions for women without work experience during the year or for 
women who worked only part of the year, two unimodal curves appear. 
The arithmetic mean and the median of the curve for the full-year 
workers differ by only $100° and an examination of the relationship be- 
tween the quartiles and median, which is a rough measure of the sym- 
metry of a distribution, indicates only moderate skewness.!° The curve 
for the nonworkers and the part-year workers is considerably more 
skewed than the curve for the full-year workers. The quartiles in this 
distribution are not symmetrical about the median;" and the mean is 
about $400 higher than the median indicating a concentration of people 
in the lower part of the distribution.” This distribution resembles the 
type of income curve which Pigou had in mind. Much of the asym- 
metry of the curve for nonworkers and part-year workers can be ex- 
plained in terms of the combination of dissimilar groups such as non- 
workers, many of whom were living on transfer payments or on in- 
herited property, and part-year workers whose periods of employment 
could have ranged anywhere from 1 week to 49 weeks. It is equally im- 
portant to note, however, that the small absolute difference between 
the mean and the median indicates that the tail of the distribution 
carries relatively little weight. Therefore, from an analytical viewpoint, 
the essential feature of this distribution may well be the symmetry 
which characterizes it throughout most of its range. 





* Arithmetic mean, $2,109; Median, $1,980. 
10 1 —(Q:/Q:) =.33; (Q2/Q:) —1 =.30. 

1 1 —(Q:/Q:) =.55; (Q:/Q:) —1 =1.05. 

13 Arithmetic mean, $1,025; Median, $628. 
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The income curve for men in Figure 3 retains some of the skewness 
and much of the bimodality which appeared in the distribution for all 
income recipients. Apparently, therefore, this curve may still contain 
elements of heterogeneity which are to be accounted for. The simple 
expedient of classifying income recipients as either full-year workers or 
as nonworkers or part-year workers, which was used for women, cannot 


WITHOUT WORK EXPERIENCE 
OR WITH I-49 WEEKS OF 
WORK IN 1949 





WORKED SO WEEKS 
OR MORE IN 1949 





Fic. 4. Per cent distribution of female income recipients by total money 
income and extent of employment, for the United States: 1949. 


be meaningfully employed for men. Although the nonworkers do repre- 
sent a significantly different group from full-year workers, the same 
meaning can not necessarily be ascribed to the part-yesar workers. A 
large proportion of the men who worked less than a full year may have 
either worked in industries like construction where pay scales are ad- 
justed to account for the seasonality of the work, or else they experi- 
enced some temporary unemployment. As a first approximation, it is 
convenient to regard the income curve for men as a combination of the 
curves for two groups: (a) those who are employed; and (b) those who 
are not employed, which includes both those who are unemployed and 
those who are not in the labor force. Although separate data are avail- 
able for the unemployed, they are combined with persons who were not 
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in the labor force because of their relatively small numbers. The in- 
clusion of the unemployed with the employed would not have sig- 
nificantly changed the results. 


TABLE 4 


PER CENT DISTRIBUTION OF CIVILIAN MALE INCOME RE- 
CIPIENTS 14 YEARS OF AGE AND OVER BY TOTAL MONEY IN- 
COME IN 1951, BY LABOR FORCE STATUS IN 
APRIL 1952, FOR THE UNITED STATES! 


























Unemployed or not in the 
labor force 
Total money income Employed Not in 
Total Unem- | the labor 
ployed force 
Number of men with 
income (thousands).... 40 ,687 5,885 950 4,935 
EE 2 acs coeeneeed 100.0 100.0 100.0 100.0 
Ng oa Ue iain 0.5 0.2 0.4 0.2 
Ere 5.1 33.7 Si..7 36.0 
$500 to GO00............... 5.3 29.3 16.3 31.8 
$1,000 to $1,499.......... 6.0 12.6 14.1 12.3 
$1,500 to $1,999.......... 6.6 8.0 14.9 6.7 
$2,000 to $2,499.......... 10.4 4.7 9.4 3.9 
$2,500 to $2,999.......... 10.8 3.4 8.7 2.3 
$3,000 to $3,499.......... 13.9 2.5 5.4 1.9 
$3,500 to $3,999.......... 11.7 1.3 1.1 1.4 
$4,000 to $4,499.......... 9.1 1.1 4.0 0.6 
$4,500 to $4,999.......... 5.4 0.6 3.2 0.3 
$5,000 to $5,999.......... 73 0.9 0.7 0.9 
$6,000 to $6,999.......... 3.2 0.5 0.7 0.4 
$7 ,000 to $9,999.......... 2.8 0.8 0.4 0.8 
$10,000 to $14,999........ 1.2 0.3 _— 0.4 
$15,000 and over......... 1.0 0.1 _ 0.1 





1 To facilitate graphic presentation, all per cents shown in Figure 5 are based on all male income 
recipients (46,572) rather than on the individual column totals which were used in this table. 


Source: U. S. Bureau of the Census, Current Population Reports—Consumer Income, Series P-60, 
No. 11, Table 4, 


It is apparent from Figure 5 that this dichotomy eliminates much of 
the bimodality from the income distribution for men. The income curve 
for men who were not employed in April 1952 is quite symmetrical. 
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Although the curve for employed men is more symmetrical than the 
curve for all men shown in Figure 3, it apparently still contains some 
elements of heterogeneity. Note, for example, the pronounced bulge in 
the lower part of the distribution as well as the extended tail of the dis- 
tribution in the direction of the higher values. In order to obtain a 
better understanding of the income distribution for employed men, it is 
necessary first to examine the distributions for the component occupa- 
tion groups. In Figure 6 the income distribution for employed men has 





PERCENT OF ALL MALE INCOME REC ENTS 


6 





Fia. 5. Per cent distribution of male income recipients by total money income 
in 1951, by employment status in April 1952, for the United States. 


been divided into three groups: (a) farmers and farm managers (9 per 
cent of the total); (b) service workers and laborers (17 per cent of the 
total); and (c) men employed in other occupations (74 per cent of the 
total). It is apparent that the distributions for the first two groups are 
asymmetrical. The distribution for farmers and farm managers is 
skewed in the direction of the higher incomes; and the distribution for 
service workers and laborers has several modal groups. In contrast, 
there appears to be considerable symmetry in the income distribution 
for other employed men despite the fact that it is an amalgamation of 
many different occupation groups. Before proceeding with an analysis 
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of the occupation groups included in the “other employed males” cate- 
gory, it is important to point out that further analysis may provide an 
explanation for the asymmetry of the distributions for farmers and for 


TABLE 5 


DISTRIBUTION OF EMPLOYED MALES 14 YEARS OF AGE AND 
OVER BY TOTAL MONEY INCOME IN 1951, BY OCCUPATION 
GROUP IN APRIL 1952, FOR THE UNITED STATES! 














Other employed men 
Independent 
Service Farmers profes- 
Total money income = = “White “Blue —. 
lebevess managers Total collar , collar f proprietors, 

workers workers’ on ean 

agers and 

officials 

Number of men with 

income (thousands). . 6,915 3,795 29,977 7,540 16 ,962 5,475 
ree 100.0 100.0 100.0 100.0 100.0 100.0 
DORs iscwtsdaddinnan 0.2 2.6 0.3 0.1 0.1 1.3 
ere 10.6 16.5 2.4 3.7 1.8 2.6 
$500 to $999.......... 10.0 15.7 2.9 3.3 2.9 2.2 
$1,000 to $1,499...... 11.6 14.9 3.6 3.0 4.0 3.2 
$1,500 to $1,999...... 12.5 $81 4.7 3.6 5.7 3.0 
$2,000 to $2,499...... 15.1 10.4 9.3 6.6 11.0 7.2 
$2,500 to $2,999...... 32.7 6.7 11.1 9.6 12.6 8.1 
$3,000 to $3,499...... 13.3 5.3 15.2 14.4 16.8 10.1 
$3 ,500 to $3,999...... 7.3 2.9 13.8 14.7 14.7 9.4 
$4,000 to $4,499...... 3.6 2.9 11.1 12.5 11.3 8.7 
$4,500 to $4,999...... 2.3 1.3 6.6 7.0 6.5 6.3 
$5,000 to $5,999...... eS 3.0 9.0 9.3 8.3 11.3 
$6,000 to $6,999...... 0.2 1.6 4.1 5.5 3.0 5.8 
$7 ,000 to $9,999...... 0.4 33 3.4 5.0 1.1 9.1 
$10 ,000 to $14,999.... — 1.2 1.4 1.1 0.2 6.2 
$15,000 and over...... 0.2 1.9 - a 0.6 _ 5.5 























1 To facilitate graphic presentation, all per cents shown in Figure 6 are based on all employed males 
with income (40,687) and those in Figure 7 are based on the total “other” employed males (29,977) 
rather than on the individual column totals which were used in this table. 

2 Salaried professional and technical workers, clerical workers, and sales workers. 

* Craftsmen, foremen, operatives and kindred workers. 

Source: U. 8. Bureau of the Census, Current Population Reports—Consumer Income, Series P-60, 
No. 11, Table 5. 


service workers and laborers. In the case of the farmers, for example, 
it is entirely possible that the single skewed distribution could be re- 
duced to two symmetrical distributions if separate data were available 
for sharecroppers and for other farmers. In the case of service workers 
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and laborers preliminary analysis indicates that even the more detailed 
occupations within these groups such as farm laborers, nonfarm labor- 
ers, and service workers have asymmetrical income distributions. It 
may well be that variations in extent of employment are a key factor 
in the explanation of the asymmetry of these distributions. 

Returning once again to the over-all income distribution for em- 
ployed men in Figure 5, it is apparent that the bulge in the lower part 
of the distribution is largely attributable to the inclusion of farmers, 
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Fia. 6. Per cent distribution of employed males by total money income in 1951, 
by occupation group in April 1952, for the United States. 


service workers, and laborers. Once these occupation groups are re- 
moved for separate analysis, the distribution for the remaining three- 
fourths of the employed men is quite symmetrical except for the fact 
that it retains a rather pronounced “tail” in the direction of the higher 
values. In order to achieve a better understanding of this distribution, 
it has been divided into three occupation groups: (a) “blue-collar” 
workers or operatives and craftsmen (57 per cent of the total); (b) 
“white-collar” workers or salesmen, clerks, and salaried professionals 
(25 per cent of the total); and (c) independent professionals, nonfarm 
proprietors, and managerial workers (18 per cent of the total). Here it 
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may be noted that the three distributions are essentially symmetrical 
and appear to differ from each other primarily with respect to level of 
income and income dispersion. The “tail” of the over-all distribution 
is largely attributable to the independent professional, business, and 
managerial group which contains about three-fourths of all men with 
incomes over $10,000. 
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Fic. 7. Per cent distribution of “other” employed males by total money income 
in 1951, by occupation group in April 1952, for the United States. 


SOME TENTATIVE CQNCLUSIONS 


In this paper an attempt has been made to explain the skewness of 
the over-all distribution of income in terms of the major components 
ef that distribution rather than by reference to special institutional 
conditions such as the advantages bestowed by inheritance. The basic 
thesis is that much of the skewness of the income curve is primarily 
attributable to the merging of several different types of distributions 
which are themselves largely symmetrical. It has been shown that 
much of the skewness of the income curve is attributable to the inclu- 
sion of women in the distribution. Although the essential difference 
between the income curves for men and women may stem from the 
mores of our society, it has little to do with inheritance. Even in the 
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case of men, there is considerable symmetry in income distribution. The 
income distribution for the three groups of occupations which together 
include about three-fourths of the employed men were found to be 
quite symmetrical when analyzed separately. 

Once the factors which largely account for the skewness of the in- 
come distribution in the United States are separated, it immediately 
becomes apparent that the skewness of income distribution for other 
times and places may have resulted from an entire different set of fac- 
tors. In less industrialized societies the landed aristocracy would un- 
doubtedly replace the professional and managerial groups as the upper 
income groups. Therefore, two societies may have the same degree of 
inequality of incomes for different reasons. 

The data which have been presented do not in themselves either 
prove or disprove the conclusion reached by Pigou. Even if the skew- 
ness of the income curve can be explained in terms of the merging of 
nonhomogeneous groups, the fact nevertheless remains that there are 
great variations in the incomes received for different kinds of work. 
Inheritance may be an important factor in explaining income inequality 
in a number of obvious ways, including the restriction of entry into the 
better-paying occupations. In 1951 only 2 per cent of the men made 
over $10,000; but they received 12 per cent of the income." It is sig- 
nificant that only a very small proportion of these men can be classified 
as the “idle rich.” For example, 99 out of every 100 of these men were 
in the labor force in April 1952, and 70 out of every 100 derived their 
incomes entirely from earnings.“ Only 2 per cent of these men derived 
their incomes entirely from property or investments." Further analysis 
of the characteristics of the highest income recipients indicates that 
nearly three-fourths of them were either independent professionals, 
businessmen, or managerial workers.'” If farmers are included, about 
85 per cent of the highest income group can be accounted for. It is ap- 
parent, therefore, that these occupations are the channels through 





18 Estimated from Census Bureau report, Series P-60, No. 11, Table 1. 

4 Ibid., Table 4. 

45 Ibid., Table 7. Earnings as defined here include income received from wages or salaries or from 
the operation of a farm, business, or professional practice. It is undoubtedly true that in a broad philo- 
sophical sense this definition of earnings may include the receipts from property or investments in which 
work was performed only in a nominal way. However, it is impossible to determine the magnitude or 
the importance of such receipts from the data which are currently available. 

16 These figures undoubtedly understate the prevalence of investment income within this group. 
They also probably exaggerate the extent of employment among the wealthy because of the inclusion 
of some people who hold purely nominal jobs. Nevertheless, there is no evidence to controvert the con- 
clusion that earnings are the most important source of income for the upper income groups. 

17 Census Bureau report, Series P-60, No. 11, Table 5. 
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which one can “hit the jackpot” in our society.'* If it can be shown that 
entry into the independent professions, business ownership, or mana- 
gerial work is limited to persons reared in wealthy families, much of 
Pigou’s analysis could be correct. However, if there is relatively free 
entry into these occupations, then it might well be that “a large part 
of the existing inequality of wealth can be regarded as produced by men 
to satisfy their tastes and preferences.’’!® In other words, to the extent 
that all men have access to the professional and managerial occupations, 
and the numbers admitted to these occupations are sufficient to keep 
monopolistic practices at a minimum, the income differences between 
these occupations and others may merely reflect the payments by so- 
ciety for rare skills or risk-taking. 





18 Some will argue that an annual income of $10,000 is a rather small “jackpot.” The data in a recent 
Census Bureau report (Series P-60, No. 14) indicate that essentially the same conclusions apply to the 
“$25,000 and over” income group. 

1M. Friedman, “Choice, chance, and the personal distribution of income,” Journal of Political 
Economy, LXI (August 1953), 





A STRUCTURE OF MONEYFLOWS* 


Morris MENDELSON 
Pennsylvania State University 


HERE has been general agreement among the reviewers that A 

Study of Moneyflows in the United States' is a significant contribu- 
tion to economics and deserves careful consideration. This paper is in- 
tended to explain the broad outlines of Professor Copeland’s money- 
flows accounts and discuss the general problems involved. It also makes 
the basic concepts more accessible to the general reader. 

Copeland’s accounts differ from both the gross national product ac- 
counts of the Nationa) Income Division (NID) of the Department of 
Commerce and the moneyflows accounts that have been subsequently 
developed from Copeland’s study by the research staff of the Board of 
Governors of the Federal Reserve System. In general the events em- 
braced by each particular set of accounts are delineated by the prob- 
lems the accounts are to deal with. There can be no direct comparison 
of the merits of the different social accounting systems and no such 
comparison will be attempted here. 

In most of the events with which we concern ourselves in economic 
analysis there is a surrender of an economic good (or service) for a 
financial claim of some sort.? The surrender of economic goods may be 
looked upon as a flow of values and the corresponding financial claims 
as the synchronous compensatory flow. While, with few exceptions, 
the first type of flow is always matched by a synchronous compensa- 
tory flow, it is not necessarily true that the matching flow is one of 
financial claims. There are also cases of two way flows of financial 
claims, 

Every flow has an origin and a terminus, i.e., an economic organism 
from which the flow originates and another at which it terminates. The 





* I am indebted to Mr. D. H. Brill and Mr. S. J. Sigel of the Research Staff of the Board of 
Governors of the Federal Reserve System both for their extensive and constructive criticism and for 
the many ideas I have absorbed from their unpublished manuscripts which they so kindly made avail- 
able to me. It proved impossible to give adequate acknowledgment to these papers and the extent to 
which they are cited below in no way indicates how heavily I have relied upon them. I am also in- 
debted to Professor M. A. Copeland of Cornell University and Mr. J. C. Dawson of the University of 
Maryland for their extensive comments, criticisms, and suggestions. I alone, however, am responsible 
for whatever errors remain. 

1M. A. Copeland, A Study of Moneyflows in the United States, National Bureau of Economic Re- 
search, New York, 1952. Henceforth this book will be referred to as Moneyflows. 

? For an elaboration of this idea see Chandler Morse, Basic Concepts of Accounting and Their Eco- 
nomic Application, Chapter 1, Norton Printing Co., Ithaca, N. Y., 1952. Financial claims are to be 
understood as including currency and bank deposits. As used here, financial claims include currency and 
deposits, book credit, securities, mortgages, and monetary metal. 
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origin of one flow is the terminus of the compensatory flow. The size of 
the flow at the terminus is equal to the size at the origin. If records were 
kept of all these flows, we would end up with a fourfold register of every 
event. Every economic organism would register two flows simultane- 
ously. Social accounting becomes possible when events can be charac- 
terized by such synchronous compensatory flows of values. This is the 
double entry aspect of social accounting. This possibility of such four- 
fold records of events makes possible the systematic organization of 
data describing the events. 

The different sets of social accounts are summaries of the two way 
flows of different groupings of the events and as such present different 
perspectives of the economy. It makes little sense to ask of any set of 
social accounts why it is superior to any other set unless they are both 
designed to deal with the same problems. It is very easy to forget that 
social accounting is a means rather than an end in economics. There is 
no set of accounts that can be held inviolable. The research staff of the 
Board of Governors of the Federal Reserve System, for instance, has 
developed a somewhat different structure of the Moneyflows accounts 
than that originally conceived by Copeland because of the different 
focus of their analytic interests. 


Broad Aspects of Moneyflows 


Moneyflows are the flows and synchronous compensatory flows that 
result from a particular set of events. The events dealt with in this 
paper are those that Copeland believes relevant to the problems he 
poses. More specifically, among these problems are: 

“Who purchases the gross national product? 

“Where does the money to buy the gross national product come 

from? : 

“What part do cash balances play in the process of business expan- 

sion and contraction? 

“What is the role of the banking and monetary system?”® 
Even this small sample of the questions that a structure of moneyflows 
can help us answer should make it abundantly clear that such a set of 
accounts is highly useful. This alone more than justifies an examination 
of the accounts. It is with the accounts that this paper is concerned, 
not with the answers. 

The events that give rise to moneyflows can be described as only 
those in which one flow results in an increase in some financial claim of 
one party and a corresponding decrease in those of a distinct second 


3 Copeland, op. cit., p. 5. 
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party.‘ Monetary gifts will be included in moneyflows though, in fact, 
it is not always possible to measure this latter flow. There must be two 
distinct parties. Each party is called a transactor and each event is 
called a transaction. 

An increase in a transactor’s financial claims is called a disposition 
of money and his corresponding counterflow a source of money. These 
correspond to debits and credits respectively. Similarly a decrease in 
financial claims owned is called a source of money and the correspond- 
ing flow a disposition. 

“Moneyflows are sources and dispositions of money.”* Each transac- 
tion is taken in its totality so that in the case of each transactor and 
transaction, the accounts, recording sources and dispositions of money, 
balance except for statistical discrepancies and deviations from ac- 
counting uniformity. In the case of each transactor there are two major 
sources and dispositions of money: financial and non-financial. For 
example, a household’s source may be the sale of a second hand good 
(non-financial) or the sale of a government bond (financial). Obviously 
only the former results in net changes in financial claims. 

Just as each transaction results in an equal source and disposition of 
money to a transactor,® each flow in the transaction is a source of 
money to one transactor and an equal disposition to the other. We can 
thus draw up several types of balancing accounts which, except for 
statistical discrepancies and deviations from accounting uniformity, 
should balance.’ 

The main circuit moneyflows is a subgroup of moneyflows transac- 
tions. Copeland suggests that the transactions to be included in the 
main circuit moneyflows are those that play a substantive part in 
affecting over-all economic adjustments.* This criterion must be a loose 
one, and Copeland does not consider it the only one. 

The working definition of the main moneyflows circuit is operational. 
It is the statement of detailed specifications of the measurement of the 
variables. Among those transactions omitted are what Copeland calls 





4 It should be noted that transactions in which the compensatory flows have the opposite effect on 
financial claims (so that the net effect is zero) are moneyflows. I have gathered from discussion that this 
description of moneyflows, while technically correct violates the spirit of moneyflows. It seems that 
barter transactions undertaken in the spirit that the financial liabilities resulting from the transactions 
are cancelled against each other, should be included in moneyflows measures though the magnitude of 
such transactions may sometimes be difficult to measure. 

5 Copeland, op. cit. p. 232. 

6 Tbid., p. 234. See implications 2 and 3. 

7 Ibid., p. 232, See First Featura 

8 Ibid., p. 10. 
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(a) money changer transactions, (b) agency transactions, and (c) finan- 
cial turnover transactions. As we can see from transaction number 14, 
Table J, all financial transactions are not completely excluded from the 
system of accounts. The net effect of financial transactions are recorded 
in the loanfund balance accounts, i.e., the statements of financial assets 
and liabilities. 

Although moneyflows do not have to be presented in the form of 
social accounts, they can be, and Copeland has presented them that 
way. To construct such a system using actual data, as Copeland did, 
necessitates explicit definitions of the sectors and the transaction types. 
There is, however, nothing final about either the definitions or the ar- 
rangement of the data. These can and should be changed to reflect the 
institutional complex and other factors appropriate to the problems at 
hand. 

In this paper, while the differences between the two sets of money- 
flows accounts will sometimes be noted, we shall pay more attention to 
the differences and similarities between Copeland’s accounts and the 
gross national product accounts. 

However before embarking upon any extensive detail of the money- 
flows structure, we may cursorily examine how even a very condensed 
version gives us an insight into business fluctuations. For this purpose 
we may leave Professor Copeland’s sector untouched. However for the 
particular illustrative analysis which we wish to undertake here we do 
not need the detailed transactions, and we shall condense them con- 
siderably. Of the fourteen types of transactions (the first fourteen row 
captions of Table I) into which Professor Copeland has classified 
moneyflows, the first thirteen are collectively labelled ordinary trans- 
actions. For our purposes the transactions may be rearranged and con- 
densed as follows: 








. GNP or Final Product ) 5. Product Receipts 


Expenditures Ordinary 


6. Transfer Receipts monsipte 


| Ordinary 
Expenditures 


. Non-Final Product 
Expenditures 7. Money Obtained Thru 


Financing 








. Transfer Expenditures 
. Money Advanced 








TOTAL DISPOSITION OF MONEY | TOTAL SOURCES OF MONEY 











TABLE I 
STRUCTURE OF THE MONEYFLOWS ACCOUNTS 
Sources & Dispositions of Money—1939—-Millions of Dollars 





























Sectors 
erent I. Households IL. Farms III. Industrial | IV. Business 
corp. prop. et al, 
Non-Financial 8 D 8 D 8 D 8 D 
rr ree 45,100 
860 780 25,200 7,800 
D.C IR. occ csc kcscccacecsc 2,700 340 60 
1,320 540 1,800 320 
By Ge IIR sks osoiccccccccas 3,800 960 ene 60 
4. Net Owner Takeouts............... 9,300 
2,420 6,160 
5. Installments to Contractors......... 2,200 3,300 
1,040 60 680 700 
Ro cho cbs cicntecnsscceas 520 220 
3,900 440 1,860 1,500 
7. Customers Moneyflows............. 140 7,900 119,500 50,400 
51,200 3,500 81,500 36,600 
8. Net Payments for Real Estate....... 600 100 ja jm 
rr ree 5,300 
3,200 480 6,500 1,900 . 
St, I cv ccidwcnnenwneceacey 20 40 10 
11. Insurance Premiums............... 160 
4,300 100 1,060 440 
12. Insurance Benefits................. 3,720 60 260 140 
13. Public Purpose Payments........... 1,560 810 40 1,000 840| fy 
1,130 30 60 i | 3 
14. Financial 
Net Increases (+) or a om (- -) 
Currency & Deposits 580 m 
‘ + 2,500 200 1,000 200 
Liabilities + 
PRs ccccccnccccsenaac A- 
+ 1,200 300 
L+ 200 _ 1,000 400 
National Gold Account......... A > 
L+ 
Federal Obligation Acc’t........ A- 
+ 200 
L+ 2,200 
Treasury Currency Acc’t........ A re 
L+ 
Other Loans, Securities & Debts A — 900 200 
I a. uesincccctsyengacs + 100 
L+ 700 700 il 
- 200 100 
Corporate Paid-In Capital. ..... A r 
L+ 
= 200 
Valuation Adjustments 
Valuation Gains................. 500 
Valuation Losses................. 600 200 
ToraL SOURCES..................+. 69,200 8,800 124,900 56,200 10,880 q 
Torat Disposrrions*............. 69,700 8,800 126, 100 56,300 108 | 
Discrepancies 
Sources Not Accounted For....... 400 1,300 200 100 
Dispositions Not Accounted For. . . 






































He ne | tnd one original tabl 
ys es are compu! rom the e are tables. 
ML mes poem itions may differ because of rounding even after adjasted for eneress and Gapeaitions not eescunted fer 
‘ Includes withdrawals of i of interest accrued on dividends left on deposit. 

5 Includes sinking and trust funds. 














TABLE I—(continued) 





STRUCTURE OF THE MONEYFLOWS ACCOUNTS 
Sources & Dispositions of Money—1939—Millions of Dollars 



















































































Sectors 
V. Feder] gsiate and X. Securities & | XI. Rest of . : 
government bal om VII. Banking | VIII. Lifeins. | IX. Other ins. | ““Lo5) estate wold Total Discrepancies? 
8S |p| s D 8 D 8 D 8 D 8 D 8 D 8 D 
ee 45,100 
aio | Mg | 4.160 580 420 240 1,060 45,100 
10 1,700 980 220 900 30 7,400 
I 560 440 1,300 160 7,400 
20 20 80 1,100 180 6,200 
240 20 130 1,300 320 6,200 
9,300 
700 9,300 
| 5,500 
1,500 40 10 10 1,100 5,500 
100 200 20 6,900 8,000 
.~ 10 50 30 30 240 8,000 
: i 360 10 1,200 3,200 185,200 
I 2,800 220 380 810 2,400 3,810 185,000 200 
7 480 os “| = 
300) By 13,700 
160 130 110 1,300 wm 13,700 
160 ” 
" 3,6408 2, 6008 6,600 
90 1204 70 380 6,600 
30 140 4,300 
80 220 2,160 1,200 4,300 
5 180 6,800 
2a | 3,240 0 0 40 6,800 
580 i) 
100 100 100 1,200 4,900 300 
5,200 5,200 
1,500 
1,500 100 \ 
3,100 3,100 
3,000 3,000 200 
on 1,400 500 100 2,200 
2,200 200 
200 200 
200 
200 
600 
1005 600 800 1,000 900 300 
m 1,200 
300 
300 
100 
500 1,340° 
100 100 40 100 1,840 
880 7,400 4,800 3,000 11,200 6.600 660° 
__| 108 {12,600 7,300 4,800 2,840 10,700 6,500 
wl | 0 2,000 
400 0 0 500 900 1,300 
Premium receipts are gross of cash > ipo by policyholders. Premium payments are shown net of such dividends. Contribu- 
tables. ds they maintain for their own employees have been deducted. 


ted for. by iy private insurance companies to 
nae in M Moneyflows. These 





on the thirteen national mon 
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Fe ney svunta’ the eleven statements of payments and balances, and the various loanfund 
w the sources for every figure used. 
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Line k of Table II B shows a growth of GNP expenditures for the 
entire period with the exception of the one year 1938. These figures do 
not agree with the GNP data published by the NID, but this is to be 
expected since the moneyflows data omit a host of accrual entries and 
transactions in kind. (The direction of change is nevertheless the same 
as in the NID data.) 

As a first approximation we may agree that any transactor who re- 
duces his loanfund balance in order to increase his GNP expenditures 
or reduces his GNP expenditures to increase his net financial claims is 
in some sense taking the initiative. An examination of Table II gives 
us an insight into what was happening in the period 1937-1942. In 1938 
the major reductions in expenditures are found in the household ac- 
count (line A) and the industrial corporations account (line P). An in- 
crease in expenditures by the federal government more or less offsets 
the other reductions. What the moneyflows account tells us that the 
GNP accounts do not is that households reduced their loanfund bal- 
ances by almost one-half billion in order to sustain their expenditures, 
whereas industrial corporations and business proprietors retrenched 
sufficiently to add to their loanfund balances. We have here confirma- 
tion of initiative attributed to the business world. 

The role of corporations in the recovery of 1939 seems to have been 
fairly passive. Decisive action can be attributed to state and local 
government (SLG) and business proprietors and partnerships (BPP). 
It should be noted that the increase in expenditures by the latter group 
is larger than that of the state and local governments, and that the 
reversal of behavior is more distinct. The BPP sectors increase in GNP 
expenditures occurred in spite of a sharp fall in the receipts recorded on 
line Z. There is a much larger element of passive responsiveness in the 
SLG activity. We may note in passing that apparently it is not only 
the sign of the excess of receipts for expenditures that is analytically 
significant; the change in the magnitude of this excess requires compar- 
able attention. 

By far the largest increase in GNP expenditures is found in the 
households (HH) account. But in this account little evidence of initi- 
ative by this sector exists. The federal government sustained its activ- 
ity and raised transfer expenditures. Its major contribution to recovery 
thus seems to have been indirect. Among the other sectors there is 
little that is unusual. 

The rest of the table confirms what has generally been known all 
along; namely, that the rapid growth of production during the early 
40’s was largely the consequence of the federal government’s war pro- 
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gram. It was the only sector with a declining loanfund balance. 

This examination is obviously cursory. Since its main purpose is to 
supply motivation for the study of moneyflows we have not examined 
many of the ramifications of Table II. Later in the paper we shall 
supplement this by utilizing the accounts to examine the financial 
ramifications of this ebb and flow of activity. In the meantime we re- 
turn to a more careful examination of the moneyflows structure. 


Sectoring 


Copeland has chosen to use the so-called object or type of transaction 
basis for sectoring.® In this type of account the entries show the types 
of transactions for which funds were used and the types from which 
funds were acquired. We may note that the sectoring in Moneyflows 
substantially exhausts the economy. The greater detailing of sectors in 
the moneyflows accounts is not just a breakdown of “parent” gross 
national product sectors. The classifications appropriate to the two sets 
of accounts are different. The setting up of gross national product ac- 
counts is plagued with the difficulties inherent in trying to force our 
institutions into two way moulds, those functional sectors that produce 
the gross national product (and pay distributive shares) and those that 
use up the gross national product (and receive distributive shares), i.e., 
intermediate and ultimate sectors. In the moneyflows accounts the 
classification of a transactor is determined by his role in the credit 
markets as well as in the goods and service markets. With respect to the 
latter market the criterion seems to be more the type of product than 
its stage—ultimate or intermediate—in the production process. This 
must be so since sectoring by the latter criterion necessitates account- 
ing vivisection of transactors whereas the consideration of financial 
flows necessitates keeping transactors intact. It should be clear, further- 
more, that the introduction of such further criteria as financial activity 
will, in itself, lead to the multiplication of sectors if the degree of 
homogeneity of the sectors is not to be reduced. To a certain extent 
moneyfiows sectors represent institutionally distinct entities important 
to understanding how a money economic system operates. In other 
societies, with other institutions, different sectoring may be required. 
It is much easier to export gross national product sectoring than 
moneyflows sectoring.’® 

Again, the gross national product accounts are oriented towards 
facilitating value judgments as well as descriptions of the economy as a 





* Ibid., p. 101. 
10 However, see Dudley Seers, “The Role of National Income Estimates in Statistical Policy of an 
Underdeveloped Area,” The Review of Economic Studies, 20 (1952-3), 159-162. 
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vast productive apparatus. The gross national product concept has 
developed from concept of national income. The latter, in turn, was 
originally designed as a measure of national welfare." As such it in- 
volves judgments of what is and what is not socially useful. The ac- 
counts are used both for measuring welfare and for analyzing economic 
activity. The moneyfiows approach, on the other hand, is oriented only 
towards an analysis of what happens. It views the economy as a pe- 
cuniary one with all the appropriate financial trappings as well as a 
productive apparatus. As a consequence, no sectors are identified as 
final in the moneyflows accounts. 

Since the gross national product accounts consolidate transactions 
between members of the intermediate sector, it makes no difference 
whether the industrial classification is by establishment or by owner- 
ship unit. Since the moneyflows accounts are largely combined,'* how- 
ever, many moneyflows totals are not invariant with respect to the 
transactor unit and the ownership unit is taken as basic. 

On the whole, the moneyflows accounts are on a much grosser basis 
than are the gross national product accounts. With some important 
exceptions, intrasector transactions are not consolidated. In Cope- 
land’s study complete decision-making units are preserved intact inso- 
far as possible and transactions between such units are recorded. 
Among the exceptions are transactions in existing real estate, the ac- 
counts for the rest of the world, the federal government, and the bank- 
ing sector. Inadequate data necessitate using merely the net balance of 
real estate transactions.'"* We have no choice but to consolidate when 
dealing with the rest of the world. The federal government is considered 
a single transactor since it is considered a single decision-making unit. 
Thus inter-agency transactions are not recorded. And finally, the 
grounds for consolidation of the banking sector are “to bring out clearly 
the relation between banks and U. S. Monetary funds and the rest of 
the economy.”!4 

To a certain extent these sectors are arbitrary. The actual set of 
sectors with which we finally deal is a compromise between the analyti- 





11 See for example Pigou’s use of this concept in the Economics of Welfare, 4th Edition, London, 
MacMillan 1932. 

12 Copeland, op. cit., p. 125. A consolidated statement can be defined as “A statement showing 
[the]... condition or operating results of two or more associated enterprises as they would appear if 
they were one organization. The preparation of a consolidated statement involves eliminations of inter- 
company accounts, investments, advances, sales, and other items.” E. L. Kohler, A Dictionary for Ac- 
countants, Prentice-Hall, New York, 1952, p. 98. 

12 Copeland, op. cit., p. 87. Gross figures are now used in the extension of the moneyflows study now 
being conducted by the research staff of the Board of Governors of the Federal Reserve System. See 
D. H. Brill, et al., Progress Report on the Moneyflows Study, Washington, 1951. (Mimeo.) 

14 Copeland, op. cit., p. 283. 
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cally desirable and the statistically possible. We find ourselves dealing 
with the eleven sectors whose abbreviated titles head the columns of 
Table I. To a certain extent Copeland’s classification follows the lines 
of the Standard Industrial Classification.“ 

Differences in sectoring principles make direct comparison of sectors 
rather difficult. When the sectoring is by the activity principle, the 
same transactor may be found in more than one account. In the gross 
national product Consolidated Business Income and Product Account 
we find some, but not all, transactions by members of several money- 
flows sectors. For example, we find that income generating, indirect tax, 
etc., expenditures, and capital consumption allowances of industrial 
corporations (Sector III) are included in the above gross national prod- 
uct account. On the other hand, capital expenditures by this same 
moneyflows sector are entered in the Gross Savings and Investment 
Account by the Department of Commerce. However, in the case of 
business proprietors, etc., (Sector IV), the corresponding current ex- 
penditures by some members, such as the private non-profit organiza- 
tions, are treated by the Department of Commerce as expenditures of 
the personal sector. The only moneyflows sectors in which all the mem- 
bers play a role in the Consolidated Business Income and Product Ac- 
count are farms (Sector II), industrial corporations (Sector III), and 
the insurance carriers (Sectors VIII and IX). Thus government cor- 
porations whose income and product transactions are included in the 
NID account are wholly transferred to the appropriate moneyflows 
government sector. 


Transaction Types 


Copeland has set up fourteen types of transactions (the first fourteen 
row captions of Table I). The first thirteen types of transactions are 
collectively labelled ordinary transactions. Changes in loanfund bal- 
ances'® are shown on the table in detail. If the details are combined 
into a single net total, the result is what Copeland calls “money ob- 
tained through financing or net money advanced or returned to others.” 
For each transactor this equals the difference between his ordinary dis- 
positions and sources of money except for statistical discrepancies. 

Generally speaking the titles of these transaction types are self ex- 
planatory. Among those that are not so self evident is customer money- 





1% Government Printing Office, Vol. 1, 1941, Vol. 2, 1942. Copeland, op. cit., p. 37. 

16 Loanfund balances are the totality of financial claims owned and owed by a transactor or sector. 
They include currency and deposits, book credit accounts, securities, mortgages, and in the case of 
monetary authorities monetary metals. They do not include insurance policy reserves and other strictly 
accrual items. 
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flows (item 7). This type of transaction includes the purchases and sales 
of all non-financial goods and services not specifically included in the 
other transaction types. It thus includes purchases and sales of second 
hand goods and intermediate goods as well as final goods. In the latter 
are included many capital goods.’ 

A second transaction type, the nature of which is not immediately 
obvious, is net owner takeouts. This is defined by Copeland as “cash 
withdrawals from proprietorship account by the owners of unincorpo- 
rated businesses and farms and the lessors of real estate minus new 
money invested by these owners.”!* The households are the only re- 
cipients of this flow. As can be seen, the flow originates in three sectors. 
Logically, the computation of this account is equivalent to (1) combin- 
ing the NID’s income of unincorporated business and rental income of 
persons, (2) subtracting the components of these which are not con- 
sidered moneyflows to households such as various accrual and imputed 
items, and (3) adding certain moneyflows to households not already 
included such as, for example, cash withdrawals of a proprietor in ex- 
cess of the income earned by the business. The reconciliation of the 
moneyflows type of transaction with the two NID types is obviously 
anything but simple. 

The final transaction type that calls for mention is public purpose 
payments. Generally speaking these are transfer items. Logically they 
should include gifts by households to households, but for fairly obvious 
practical reasons, the estimates do not. Logically, this transaction type 
should include all international transfers, but actually it includes only 
personal remittances abroad. 

To some extent, the inclusion or exclusion of particular transactions 
in the moneyflows accounts are predetermined by the condition that 
the accounts must balance and the prior decision to include or exclude 
other transactions. The introduction of net changes in the loanfund 
balances as a sort of balancing item in the account has important 
ramifications with respect to what can and can not be excluded. The 
counterflow of any flow which affects these net changes must necessarily 
be included or the balance will be destroyed. Thus, for instance, cash 
interest received must be included regardless of its economic signifi- 
cance. The exclusion of this item would make the difference between 
ordinary receipts and expenditures less than net changes in the loan- 





17 See below. Note also that construction is included in installments to contractors rather than in 
customer moneyflows. Note too that in the case of second-hand goods, only half the purchase cost is in- 
cluded. 

18 Copeland, op. cit., p. 17. 
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fund balance which this difference is supposed to equal. On the other 
hand the sole consideration in the case of money-changer transactions 
is their significance, or rather in this case, their lack of significance. 
These transactions are entirely financial and affect both sides of each 
account equally. While for many purposes any transaction which 
merely affects the composition of the loanfund balances without affect- 
ing their size can be excluded, the information would be essential for 
most analysis of monetary and credit flows. 

Moneyflows are neither on a strictly cash nor accrual basis.'° Wages 
and taxes are on a cash rather than accrual basis.?° But credit and in- 
stallment purchases are recorded as well as cash purchases. However, 
since the system of accounts embraces only transactions between dis- 
tinct parties and only transactions that involve the use of cash or other 
financial claims, neither imputed income, income in kind, nor internal 
transactions such as depreciation appear.*! 

The inclusion of various credit transactions involves the underlying 
and important institutional fact that in a good many transactions other 
financial claims are used instead of money. Trade credit is one such 
substitute. A further consequence is that offset settlements are kept as 
distinct transactions.” 

The various factors noted above contribute to the differentiation of 
the moneyflows and NID accounts. But perhaps one of the most strik- 
ing differences between these two sets of accounts is the failure to 
identify capital formation items in the former. A glance at Table I will 
show this is so. This difference is not the result of inherent differences 
in the two sets of accounts. With the expansion of the number of sectors 
in the moneyflows accounts, it was statistically impossible, with the 
data available at the time, to identify sales that go on capital accounts. 

After adjustment for the imputation and accrual items the relation- 
ship between the moneyflows accounts and the gross national product 
accounts can be displayed and tables such as Table II set up. However, 
the resulting accounts are deceivingly simple. While the conceptual re- 
lationship between the two systems is fairly straightforward, an actual 
statistical reconciliation of the two sets of accounts is exceedingly com- 
plex and difficult. 





19 P, 19, 

20 See S. J. Sigel, “A Comparison of the Structure of Three Social Accounting Systems,” Confer- 
ence on Research in Income and Wealth, October 1952, N.B.E.R. (mimeo). Mr. Sigel’s remarks are 
directed at the revised system used at the Federal Reserve Board but most of his remarks apply as well 
to Moneyflows. 

2% Copeland, op. cit., p. 69. 

2 Copeland, op. cit., p. 69. 








86 AMERICAN STATISTICAL ASSOCIATION JOURNAL, MARCH 10955 


The Structure 


When a transactor participates in a main money circuit transaction, 
four simultaneous entries are recorded. One pair of entries records the 
disposition and sources of money of one transactor. The other pair 
records the dispositions and source of the other party to each transac- 
tion. On Table I we can see what would have been the accounting effect 
if households had purchased an additional $100 million worth of goods 
from, for example, corner groceries, on a credit basis. In the household 
account (Sector 1) dispositions of money on account of customer 
moneyflows (item 7) would read $51.3 billion instead of $51.2 billion as 
it now does. In the same sector account this change would be balanced 
by a change in book credit (L..) (increased Accounts Payable) from $.2 
billion to $.3 billion. Similarly in the account of Sector IV, to which 
corner groceries belong, sources on account of customer moneyflows 
would read $50.5 billion instead of $50.4 and book credit (A,) (in- 
creased Accounts Receivable) would read $.4 billion instead of $.3 bil- 
lion. It can be seen easily that this procedure leaves the accounts of 
both sectors in balance and at the same time leaves the rows of both 
customer moneyflows and book credit in balance. 

In the table, by examining the rows, we can see to which sectors the 
transactions have been sources and to which they have been disposi- 
tions of money, and of course the magnitude of the dispositions and 
sources of money on the account of the transactions. Similarly, by 
examining the columns we can see the transactions of any particular 
sector. If the financial section, the portion of the table showing net in- 
creases or decreases in financial assets and liabilities, had been in terms 
of absolute values instead of changes, we would have had what Cope- 
land calls statements of payments and balances. A separate computa- 
tion of loanfund financing, i.e., net changes in balances of financial 
assets and debts, would be necessary to complete the statement. 

“Net money obtained through financing” and “net money advanced 
or returned to others” can be estimated either by balancing the ordi- 
nary receipts and expenditures or from the changes in the loanfund 
balances. The two estimates should be equal except for statistical dis- 
crepancies and deviations from accounting uniformity. If we recapitu- 
late the funds obtained and advanced, we can see which sectors ad- 
vanced money and which sectors obtained it. This is shown in Table 
III which Copeland labels “Net Money Obtained or Advanced a/c 
Loanfund Financing.” 

Table I shows the moneyflows for only one year. We can see the 
matrix of mutual constraints that exist in the economy by examining 
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the table in its totality. Neither the sources nor the dispositions of 
money of any sector can change without resulting in a corresponding 


TABLE III 


NET MONEY OBTAINED OR ADVANCED a/c LOAN- 
FUND FINANCING 
(Millions of Dollars) 



































1936 | 1937 1938 1939 1940 1941 1942 

Net Money Osralnen BY: 
Pe EY cicch asp kaiser ceawadasepmdkaa winced 0 0 400 0 100 0 0 
RNR EE rarer Merry re me bere e 0 300 100 0 200 0 0 
C Industrial Corporations. ............esecsseess 0 | 1,100 0 0 0 0 0 
D Business Proprietors and Partnerships et al... ... 0 0! 0 300 0 0 0 
E The Federal Government...................... 5,050 300 | 1,350 | 2,200 | 2,400 | 10,050 | 41,150 
F State and Oocal Governments.................. 0 0 400 0 0 0 
H Security and Realty Firms et al................ 1,200 800 0 400 200 300 0 
J Tie OE OD We cis eccescncicsescssness 0 0'| 1,000 700 | 1,500 | 1,100 0 
I Re indsiensrdcacternccsaccae 6,200 | 2,500 | 2,800 | 4,100 | 4,500 | 11,500 | 41,200 
Net Money ADVANCED oR RETURNED BY: 
Ric MI cieisicccdcurdacpuasndesucoeeeene 2,300 100 0 300 0} 5,000 | 19,600 
Be cs bittnt cacdeeacasadscaeerhannidecaes 300 0 0 600 0 400 | 1,800 
N Industrial Corporations. ..............-secee0- 300 0 200 | 2,200 | 1,600 | 3,600 | 10,200 
P Business Proprietors and Partnerships et al...... 700 0 800 0 500 300 | 4,300 
Q State and Local Governments.................. 300 200 300 0+ 400 800 | 1,200 
R_ Banks and U.S. Monetary Funds.............. 400 500 300 400 800 500 200 
8 Life Insurance Companies...................+. 1,500 | 1,400 | 1,600 | 1,600 | 1,700 | 2,200 | 3,400 
T Other Insurance Carriers..................0005 400 400 200 300 600 400 600 
U_ Security and Realty Firms et al................ 0 0 100 0 0 0 800 
W Tie Be Ee Wa occ. ovccccccvccccccsss 200 0 0 0 0 0 200 
eI oc. 5.ccu cs an ccanksnasenesued 6,500 | 2,600 | 3,500 | 5,400 | 5,600 | 13,300 | 41,200 

Discrepancy (Money Advanced or Returned mi- 300 100 800 | 1,300 | 1,100 | 1,800 03 

nus Money Obtained) 

1 Less than $50 million. 


? Lies betweer + $50 million. 

Note: Due to rounding columns may not precisely downtotal. 

Source: The sources of the individual items appear in a final column in the original table, but since they are in code 
and refer to the extensive appendix of Moneyflows, they are not reproduced here, 

Reproduced with permission from M. A. Copeland, A Study of Moneyflows in the United States, National Bureau 
of Economic Research, New York, 1952. 


change in the account of some other sector. In other words no sector can 
change its expenditures, receipts, or composition of its loanfund balance 
without affecting one or more other sectors. 
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For many purposes it is better to arrange the data, as Copeland has, 
in separate tables for each sector, transaction, and loanfund balance 
and include the data for every year for which they are available.” It is 
then possible to get a historical picture of moneyflows, as in Tables IT, 
and III. For most analytic purposes the time series variety of tables is 
to be preferred. For behavior analysis the structure form gives only a 
partial picture. The greatest advantage of Table I lies in demonstrating 
the mutual interdependence of our society. 

By examining the financial se_tions of Copeland’s statements of pay- 
ments and balances (which we do not reproduce here for lack of space) 
we can complement the picture of economic activity portrayed above 
with an analysis of its financial ramification. We shall not attempt to 
be anything but sketchy. We noted that in 1938 H.H. reduced and 
industrial corporation increased their loanfund balances. What im- 
plications did this have to the volume and distribution of negotiable 
claims? We may note first of all that of a fairly sizeable increase in cur- 
rency and deposits outstanding households received relatively little and 
industrial corporations a great deal. However, the gain of $.9 billion in 
currency and deposits by the latter turned out to be far greater than 
its net money obtained through financing and is reflected in the liquida- 
tion of accounts receivable and an increase in paid-in capital. We find 
a shuffling from government to other securities and an increase in 
interest bearing debt. While we find that most sectors shared in the in- 
crease in currency and deposits the increase in federal securities was 
largely picked up by the banking and life insurance sectors. Interestingly 
enough, this did not mean that the life insurance companies reduced 
their holdings of currency and deposits. Quite the contrary these hold- 
ings were increased. The ordinary course of events yielded them a suffi- 
cient excess of receipts over expenditures to make a substantial pur- 
chase of other loans and securities as well. 

Of interest also is the significant change in the distribution of federal 
securities in the early 40’s. Without exception all sectors purchased 
securities. Banks continued to hold about one-half of federal obliga- 
tions. The percentage held a) by households fell from about 1/4 to 
about 1/5 and b) by life insurance companies fell from about 1/9 in 
1940 to about 1/13 in 1942. On the other hand industrial corporations 
holdings rose from about 1/45 to about 1/12. 





% The tables in the Progress Report are roughly similar to those in Moneyflows, in that they are 
both of the time series variety. However, the two sets of tables do not have uniform categories. Copeland 
does not present a detailed comprehensive picture of his accounts as is found in Table I. However, the 
structure of the Federal Reserve Board moneyfiows accounts for 1947 has been prepared and used as 
part of a paper on the moneyfiows study given by D. H. Brill to the Econometric Society Meetings in 
December 1951, and the format of Table I is an adaption of this to Copeland's accounts. 
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Again, in Table III we can see the considerable increase in loanfund 
balances that had already taken place by the end of the first year of 
World War II. We could go further and examine the composition of 
these changes in fascinating detail but again the purposes of this ex- 
cursion are to supply illustrations of the types of analysis moneyflows 
makes possible and, not least, motivation. Analysis, as such, of the period 
is beyond the scope of this paper. 


Discrepancies 

The discrepancies disclosed in Table I may appear to be too large for 
comfort. To a large extent this is due to the data available rather than 
to deficiencies in the system itself. But some of the discrepancies are 
inherent in the accounts. There are three major sources of non-uni- 
formity in the accounts that contribute to these discrepancies. They 
are non-uniformity in timing, valuation, and classification. 

First, for example, non-uniformity in timing results from the fact 
that the debtor’s record of payment precedes in time the creditor’s 
record of receipts. It takes time for payments to come through the 
mails. This discrepancy is also reflected in the national currency and 
deposit account so that the deposit holdings of the payee are under- 
stated and the banking sector reports more liabilities than there are 
apparent charges against it in the currency and deposit account. 

A second mail float, which Copeland fails to note,* tends to make 
receivables exceed payables. The float arises from the time lag involved 
in mailing invoices to the purchasers of goods and services. The dis- 
crepancy will also be reflected in some other account. If, for instance, 
the transaction involves a customer moneyflow, the dispositions in this 
account will tend to lag behind the sources 

Lack of uniformity in valuation arises from different accounting 
practices which often lead debtors and creditors to value the same items 
differently. In the book credit account, receivables are listed as net and 
payables as gross. In other words, the creditors’ estimates of uncol- 
lectable accounts are already written off, but no debtor has reason to 
believe that this is the case with his account. It is such items as these 
that make up the valuation gains and losses at the bottom of the table. 
They also give rise to a tendency for estimates of payables to exceed 
receivables offsetting to some extent the opposite tendency resulting 
from non-uniformity in timing. They introduce difficulties in the com- 





% Copeland, op. cit., p. 234. 

% There is reason to believe that Copeland ignored this mail float because of the inadequacy of the 
data. 
* Copeland, op. cit., p. 153. 
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putation of financial flows. The financial flows recorded in moneyflows 
accounts are composed only of those changes in the financial assets and 
liabilities that occur as a result of market transactions. Write-ups and 
write-downs must be removed from book increments and decrements 
to show those changes actually resulting from market transactions. 

Finally, lack of uniformity in classification is found in the case of the 
purchase of gold from domestic producers by the banking sector. The 
domestic producers treat gold as a commodity and consequently credit 
customer moneyflows. On the other hand, the banks treat gold as a 
loanfund item and record no corresponding debit to customer money- 
flows.?? Similarly in the sale of securities by brokers to households, the 
households often treat the transaction as affecting only the composition 
of their loanfund balance, while the brokers take cognizance of the 
service changes involved. 

The examples of discrepancies we have noted are not exhaustive of 
all the discrepancies that can and do occur. The discrepancies that do 
exist and the inadequacies of the data combine to introduce an element 
of imprecision in the moneyflows accounts. Fortunately, however, 
many of the discrepancies balance against each other. 


Prospects for Moneyflows 


Whatever defects may be found in the accounts, Copeland’s Money- 
flows stands as a significant contribution to economics. The value of the 
framework does not depend solely upon the validity or lack of validity 
of the discretionary hypothesis developed from it by Copeland. More 
than one model is compatible with the moneyflows framework in the 
same way as more than one model is compatible with the gross national 
product framework. One of the great advantages of the NID accounts 
is the ordering they make possible of the many series of data which we 
have had for a long time, and not incidentally, for the gaps in our 
statistical knowledge that they disclose. The moneyflows accounts do 
the same for other series of data. They enable us to bring together into 
an integrated picture the financial and nonfinancial activities of the 
various major sectors. 

*The financial facts in the moneyflows accounts are financial market 
facts. They record sales less purchases of stocks and bonds, of govern- 
ment securities, and other negotiable instruments not separately re- 
corded in the accounts. As we have seen they can tell us who liquidated 





37 Tbid., p. 87. 

* The following three paragraphs were written by Prof. J. C. Dawson of the University of Mary- 
land. They appear substantially in their original form. However, I have made a few changes and the 
responsibility for any error is, therefore, mine. 





asain deg ee al 





re eed 


\w = = a 














A STRUCTURE OF MONEYFLOWS 91 


portfolios, who increased their debts, who made more use of trade 
credit, who borrowed from banks, and who drew down their cash bal- 
ances. We cannot find these financial market phenomena in the NID 
Gross Savings and Investment Account. The latter does not tell us what 
financial forms personal saving takes, or what financial means were 
used by business or government to obtain funds. It does not disclose the 
linkage between the two through our financial institutions. 

Lest the reader find this unnecessarily vague let him attempt answers 
to the following by examining the Commerce accounts: How much 
money does non-corporate business obtain from the financial markets 
and in what forms? What impact would choking off bank credit have 
on outside corporate financing? What sectors do insurance policy re- 
serves finance? How do we relate purchases of new houses and other 
durable consumer goods to the means of financing them, e.g., the avail- 
ability of consumer credit? It will be recalled that increased corporate 
security holdings by households, increments in insurance policy re- 
serves, and purchases of new homes by prospective owner occupants 
are all components of personal saving—though not separately itemized 
in the Commerce residual estimate. It may be said that the gross sav- 
ings and investment account was not designed to answer such questions. 
Yet much current economic analysis relating finance to production— 
analysis that ought to concern itself with questions such as these—is 
done in terms of this account. 

While some of the financial market transactions listed above are not 
separately identified even in the moneyflows account, most of them can 
be revealed by the various sector statements of payments and balances. 
And via this account each sector’s financial operations can be related to 
its operations in the production area, enabling a synthesis of these two 
major areas in over-all economic adjustment. But part of the advantage 
is not that moneyflows accounts are sufficiently detailed for analysis, 
but that they are capable of being expanded further to show further 
details. 

Thus the usefulness of the accounts can hardly be questioned. We 
cannot and will not attempt to catalogue further the myriad of applica- 
tions that are possible. We may note that among other things, they 
provide us with an X-Ray into the fabric of the veil of money. The 
confusion of the flows of goods, services, and resources with flows of 
money has led many an economist into errors that can be avoided by 
the utilization of the moneyflows framework or some variant thereof. 
The questions propounded by Copeland, some of which we have noted 
earlier, and his answers to them are impressive. 
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Furthermore, the accounts provide us with a framework with which 
we may be able to elaborate, or even reconstruct, existing theory, and 
mould theory into a more powerful instrument for analyzing current 
activity and providing greater insight into subsequent events. In the 
account of each sector there are possibilities of finding expenditure 
functions. Until the interrelationships between the various items of the 
accounts have been carefully examined, there exists the possibility that 
the mine of information that may exist in these accounts has not been 
exhausted. 

A final word about the use of the word structure in this paper is in 
order. As has been noted, a structure of moneyflows is not exactly 
analogous to a structure of input-output relationships. The data that 
appear in an input-output table are the results of economic organisms 
operating within the confines of technological or engineering relation- 
ships. It is believed that the ratios of the values in the cells of the latter 
table are free to move only within narrowly confined limits. As a conse- 
quence, the table for one bench mark year supplies us with a consider- 
able amount of information about these relationships. The same is not 
true in the case of moneyflows. Nor, incidentally, is it true of the gross 
national product accounts. In the case of moneyflows, the ratios of the 
values in the cells are determined, to some extent at least, by behavior 
relationships. These, as we have discovered to our chagrin, are no- 
toriously hard to establish. It is a rash economist indeed that propounds 
a behavior relationship on the basis of data for one year. Even with 
data for seven years, Copeland was justifiably hesitant about pro- 
pounding the existence of specific relationships. Historical series, and 
long historical series, are a necessity for the investigation of such rela- 
tionships. Furthermore, whereas engineers can supply us with informa- 
tion about changes in technological relationships, we are unlikely to 
discover analogous changes in behavior relationships as easily. Conse- 
quently, there is a sense in which it seems that input-output studies are 
more efficient, though not necessarily more useful. Any such difference 
in efficiency, however, is not inherent in the nature of the two ap- 
proaches, but rather in the subject matter with which they deal, and 
in the light of the difficulties that are now being encountered in carrying 
the input-output studies further, it seems that the difference in effi- 
ciency can easily be overrated. 























THE MEASUREMENT OF SEASONAL MOVEMENTS IN 
PRICE AND QUANTITY INDEXES 


Bruce D. MupGettr 
Brown University 


HE problem of constructing a price index (or quantity index) that 
Ti correctly measure price or quantity changes over the seasons 
has occupied index number makers for many years, but no solution of 
this problem that has yet been proposed has been very satisfactory. 

One method used has been to weight the index with weights ap- 
propriate to the yearly importance of the individual commodities and 
then to neglect variations in importance over the year. For commodities 
that disappear wholly from the market at certain seasons of the year 
the index has been carried forward over this period by retaining the last 
quoted price until a new quotation appears in the market. This is a 
typical procedure though there have been variations in detail. It is not 
difficult to see that this is no solution of the problem at all; that, in- 
stead of measuring the actual price variations over the seasons the in- 
dex number maker has introduced quotations that have nothing to do 
with particular seasons at all. Consider an index, for example, that is 
published at monthly intervals and suppose that the weights for all 
commodities in the index are annual weights, that is, they are measures 
of the yearly, not monthly, importance of each commodity. Consider 
now the February index, on the not unrealistic assumption that there 
has been no price quotation for a given commodity since the previous 
November. The position taken here is that a monthly index for Febru- 
ary constructed by including this commodity with a price belonging to 
the previous November and a weight measuring annual and not 
monthly importance, gives a resulting index with an element, so far as 
this particular commodity is concerned, that is wholly unrealistic. If 
indexes are measures, more or less exact to be sure, of actual price 
situations through which an economy has passed, then the procedure 
described deserves a rather solid condemnation as soon as an improved 
procedure can be devised. 

There is described herewith a procedure for measuring seasonal price 
influences! that the writer believes to possess enough merit to justify 
publication and an invitation for critical appraisal by other statisti- 
cians.? 





1 The procedure will be developed in terms of price indexes, but it is directly applicable to quantity 
indexes with nothing more than the interchange of the p’s and the g's of ordinary index number notation. 
2 A word needs to be said as to the origin of this solution. Over the period of the revision of the in- 
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Consider the concept of a yearly price level as measured by the tra- 
ditional value aggregate > pogo familiar in modern index number nota- 
tion, where the summation is over the commodity list upon which the 
index is based and where the weights, q, are appropriate measures of the 
importance of the prices, p. The base period prices here are the p’s and 
the weights the qg’s. No question need be raised about the appropriate- 
ness of the q’s, since they are completely general at this point and there- 
fore any index number maker may give them specific content with re- 
spect to the way he defines appropriateness of weights. The above ag- 
gregative for the year 0 is built up by the continuous summation of 
value items over the year. It can therefore be broken down into sub- 
aggregates, by months for example, if months are the smallest time 
intervals for which indexes are wanted. If the year’s aggregative repre- 
sented a year’s family expenditures based upon the annual consumption 
quantities, g., then its monthly components would measure the spread 
of the yearly total over the seasons. 

A realistic measurement of price changes over the seasons can be 
built upon this conception. The proposed index will first be designed 
using a fixed weight aggregative formula, but it will be shown later that 
the principle on which the measurement is based is independent of this 
formula and is equally applicable to any of the theoretically best formu- 
las. This generality is a matter of prime significance. In order accurately 
to describe this index an exact notation is required through which 
various needed aggregates are given meaningful and precise definition. 
This notation and the appropriate aggregates follow at once. 

Notation: 

(1) p, g=prices, quantities of individual commodities. 
(2) P, Q=indexes of prices, quantities. 
(3) Superscripts, t, represent commodities. They will be omitted 
where, in the context, they are not needed. 
(4) Subscripts, 7j, represent: 
years 7=0,1,2,---. 
months j= 1, 2, - - - 12. 
dexes of wholesale and consumer prices of the United States Bureau of Labor Statistics, November 1949 
to February 1953, many of the discussions which took place between the Bureau and the Technical 
Advisory Committee of the American Statistical Association were devoted to the discussion of the 
problem of seasonal measurement. During these years several memoranda on the subject were prepared, 
significantly those by Doris Rothwell of the Bureau staff and by Abner Hurwitz, Chief of the Cost of 
Living Section of the Bureau staff. The development of the present writer's views were influenced in an 
important way by these memoranda and by the discussions of them. The writer submitted one memo- 
randum which is the substance of what is presented here and wishes to say that while he is the sole 
author of what follows and takes full responsibility for it, it is nevertheless true that his own views on 


seasonal measurement were developed and his understanding of the subject greatly clarified by these 
discussions. 
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In general 
0 =base year 
7=a given year 
ij =a given month of a given year. 
(5) Number of commodities, t. 
N.j=number of commodities in list for month j(of the year 
a). 
N,=number of commodities in the list for the year a (some 
weight period), 
12 
= > N,,; minus duplications. 
j=l 
Fixed monthly weights for commodity ¢: 


q;‘? = quantity of commodity ¢ for month /. 
t=1,2,---WN. 
j=1, 2, -- +12. 
(In general, this magnitude is not the actual quantity for a 
given month obtained in the market for that month, but the 
fixed quantity used as weight.) 


12 
> gi = qa? = annual quantity (weight) of commodity ¢. 
j=l 
Monthly average price, p;;“, and yearly average price, p;‘, of com- 
modity ¢ for year 7. 
Then, for 7=0, 


12 
De Pos gos 


j=l 


12 
De i 


j=l 





= po = base year average price of commodity ¢. 


And since 


12 


Yai = a0, 


jel 


therefore 


12 
De Boj gos = pogo. 


j=l 


Similarly for any 7. 
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Monthly and yearly aggregates and index numbers 


Base year, 0 
Month BS a Cumulative Monthly Aggregates 
Nal Nal 
Jan. , Pog b> Pon = An 
t=1 t=] 
Na? 2 Nai 
Febr. Di Pode D LD pogs = Ans 
t=1 jel tol 
Ne.1s 12 Nai Ne 
Dec. DL Pogis DX DX pog; = Aowe = Ao = Di Pode 


t=] j=l t=1 t=1 
12 Nai 


Na 
Year DD pgs = LD Poge = Ao 


j=l t=1 tel 





Given year, 4 [¢=0, 7 2, i ana ] 





Menth Monthl Cumulative Index 

Aggregates Monthly Aggregates Numbers 
Nai Nal Aa 
Jan. > Pag } ms Pan = Aa Pea = — 
t=1 t=1 01 
Na2 2 Nai Aw 
Febr. 2 pings DD pigs = Aa Poa = — 
t=1 j=l tol Ao: 

Na-12 12 Nai _* 
Dec. > Pi-12912 > > Pi3Qi = Ain = A; Po. ica2) =< 
t=1 j=l t=l Ao 

12 Nai Na A; 
Year Divi; = Lie = As Poi = = 
0 


j=1 t=1 t=] 


From the index numbers given above for the year 7 it will be seen 
that a full set of all monthly indexes is available, as well as all yearly 
indexes. Each monthly index however measures the cumulative price 
change of the given year through the month in question. That is, the 
February index measures the cumulative effect of price changesthrough 
January and February. Similarly for the other months. Several im- 
portant properties of this index may be pointed out: 


(1) The basic price index is the yearly index. 
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(2) The monthly indexes within any year measure the change in the 
cumulating influence of the seasons. The December index for 
any year is identical with the year index. 

(8) The price index for any month of any year upon any other month 
of that or another year may be obtained by direct division of the 
two indexes. Thus the July 7 index on the June 7 base becomes 


Po.iz + Po. is. 


(4) The above procedure is independent of the formula used, in the 
sense indicated herewith: 

(a) As used above, the price change is measured by the fixed 
weight aggregative formula. 

(b) The procedure will give a measurement of price changes by 
the Laspeyres or Paasche formula, merely by replacing 
weights ¢;“ by go;‘” (Laspeyres) or by q;;‘” (Paasche). Then 
any cross between these two is easily obtained. The Mar- 
shall-Edgeworth formula is obtained by using (qo; +4:;“) 
as weights. 

(c) The question of using a fixed-base formula or at chaining 
consecutive yearly links is entirely independent of the pro- 
cedure of measuring the monthly changes. 

(d) One important operational result is inherent in this pro- 
cedure, namely that the year index, upon whatever base, 
emerges as an end result from the calculation of the twelve 
monthly indexes. 


It is important not to lose sight of the significance of the statements 
(1) and (2) above. The basic index is a yearly index and as a price or 
quantity index is of the same sort as those about which books and 
pamphlets have been written in quantity over the years. The within- 
the-year indexes partake of a slightly different character, in that they 
measure the cumulating influences of price and quantity changes over 
the seasons. But they do this with complete realism in the sense of 
giving a measurement of the price change or quantity change that actu- 
ally took place. This statement is not intended to refer to the use of 
approximate for exact weights (the problem of correct weights being a 
separate problem in index number theory and practice) nor to errors in 
price quotations (the collection and compilation of prices being a 
second separate problem in index number theory and practice). It is a 
proposal for replacement of a practice, which can never be anything but 
incorrect, of using annual weights, say, for February in place of Feb- 
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ruary weights, and of using, say, November prices for February when 
no price ever existed for February. To introduce these non-appropriate 
and non-existent data into an index number in order to satisfy a public 
demand for monthly indexes is to create something out of nothing; and 
that it introduces a spurious element into index-number construction 
can scarcely be denied. 

The annual index here proposed is what an index has always been, 
namely a measure of price change upon a basis selected by the maker 
and measuring price change with an accuracy which is dependent on 
(1) the accuracy of the basic data, (2) the accuracy of the sampling 
involved, and (3) the efficiency of the formula used. The monthly 
indexes, subject to all the elements of accuracy or of error enumerated 
above, can give an accurate measure of the cumulative influences of 
price change (or quantity change) throughout the months of the year, 
compared to the corresponding months of the year chosen as base; and 
this is done with the complete realism that is associated with the disap- 
pearance of some commodities at some seasons and their reappearance 
at others. This sort of measurement has never yet been made, or at 
least the present author has never seen a description of it in the litera- 
ture, but the price measurement that is here under consideration, it is 
submitted, is just what is wanted for a realistic settlement of issues that 
arise today between the proponents of various sides of modern con- 
troversies over price changes and over changes in real production. 


























AN APPLICATION OF MARKOV PROCESSES TO THE 
STUDY OF THE EPIDEMIOLOGY OF 
MENTAL DISEASE 


ANDREW W. MARSHALL AND HERBERT GOLDHAMER 
The Rand Corporation 


I. INTRODUCTION 


E SHALL present in this paper some methods that we have been 
W cppiving to studies of mental disease. These studies are epidemio- 
logical in character, that is they deal with the distribution and fre- 
quency of mental disease in different population groups. We are pri- 
marily interested in getting answers to several substantive problems. 
In this paper, however, our concern is methodological and the emphasis 
is on the applied mathematical statistics and model construction of 
simple Markov processes. The models that we shall discuss are very 
simple ones and in their present form are intended only to provide 
rough estimates of certain epidemiological parameters that are not di- 
rectly observable or that can be secured only by expensive and time- 
consuming field surveys. 


II. OBJECTIVES 


The work that we shall report on has grown out of an interest in 
characterizing the age distributions of the mentally-ill population. 
These constitute the basic data of most epidemiological analyses. Age 
distributions are likely, however, to have a different significance ac- 
cording to the point in the pathological process used to represent the 
age structure of the population. Thus patients may be distributed by 
the age at which the disease had its onset, by the age at which they 
sought medical care or were institutionalized or by age at death, to 
mention only a few possibilities. The age distributions available in the 
literature of mental disease are mostly based on the age at which pa- 
tients are admitted to a mental hospital. This is due, of course, to the 
ready availability of such data in hospital records and to the difficulty 
of specifying age of onset in the case of mental diseases whose early de- 
velopment is often insidious and in any case not usually investigated or 
recorded with care. The widespread use of admission age has led in 
some cases to interpretations that are misleading. Thus one writer 
using hospital admission data argues that involutional psychoses in 
women are more closely associated with social than biological factors 
since their greatest incidence occurs after the involutional period. This 
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writer has developed a theory about the special stresses of life for 
women in the age period of about 50-55. However when age-at-onset 
data are used, the peak incidence for the involutional pychoses among 
women is approximately 45 years. 

There are a number of hypotheses and clues in the study of mental 
disease that are closely linked to the age characteristics of the disease 
process. For this reason we have thought it worth while to attempt to 
substitute for the prevailing age-at-admission distributions a series of 
distributions based on age-at-onset. We found that the mental hospital 
systems of Ohio and Illinois record (but have not tabulated) for each 
admitted patient an age-at-onset of the disease.1 We recognize the 
deficiencies of these data and the numerous sources of inaccuracy by 
which they are affected. We believe that the data are subject to a 
systematic bias that tends to displace the true date of onset forward 
toward the time of admission.? If this is so, these age-at-onset distribu- 
tions permit us to re-analyze the age structure of the patient population 
in terms of distributions that have at least been shifted in the right 
direction and represent the least amount of correction that is required. 

The availability of age-at-onset data in Illinois and Ohio provided 
the possibility of attacking at the same time a closely related problem 
which concerned us. It is this problem and the method of dealing with 
it that we now wish to discuss. 

Important theoretical and practical considerations hinge on adequate 
answers to the question: What is the age-specific incidence of mental 
disease? This question has been answered primarily in terms of hospital 
first admission data, but rates obtained in this manner are significant 
for the incidence problem only if we can assume that the number of 
psychotics never hospitalized is negligible and that the lag between 
onset and hospitalization is very short. Neither of these assumptions is 
obviously justified a priori; in any case the legitimacy of using hospital 
admission rates could only be established by comparing them with the 
true incidence rates. 

The most obvious means of securing a true incidence rate is by direct 
field investigation of a population. Field investigations must usually 
confine themselves to rather limited population groups and for this 
reason it is generally easier in these cases to secure a prevalence than an 
incidence measure. (By a prevalence measure we mean the proportion 





1 We wish to acknowledge our indebtedness and express our gratitude to the Ohio and Illinois 
Departments of Public Welfare for permitting us to reproduce their punch cards on which onset data 
are recorded. 

? This presumption is supported by data provided by Larsson and Sjérgen in a study of a Swedish 
population which has just come to hand. Cf. reference [5] and footnote 4. 




















MARKOV PROCESSES APPLIED TO STUDY OF MENTAL DISEASE 101 


of a population that is affected at a given time; by an incidence meas- 
ure, the proportion that has become affected for the first time in a given 
period, usually one year.) Prevalence measures can, however, be con- 
verted to incidence measures, although this will usually lead to some 
loss of accuracy in the latter. But whether one aims at prevalence or 
incidence measures, field survey procedures are costly and time-con- 
suming. More important, it is extremely difficult to ensure that all non- 
institutionalized cases will be uncovered. Nonetheless, field surveys 
have been made with varying success. European, especially Scandi- 
navian, investigators have contributed most to this work. The latest in 
a series of excellent studies are by Fremming [2], and by Larsson and 
Sjérgen [5]. In the United States we had for some time only two field 
surveys of value [4, 9], although others are now underway [3]. 

Our own approach to this problem has been quite different, although 
its adequacy will undoubtedly require testing in the light of rates es- 
tablished by direct field investigation. We have devised several very 
simple models of the process (or perhaps more accurately, the stages 
or states) involved in the passage from sanity to insanity, hospitaliza- 
tion and death. Using the assumptions involved in one of the models 
and the data that the model requires, we have attempted to estimate 
what the total incidence rate must have been in order for it to have 
generated the known hospital admission rates. As we shall see more 
fully later, if we know something about the age-specific death rates of 
non-institutionalized psychotics, the death rates of the normal popula- 
tion, hospital admission rates and the lapse of time occurring between 
onset of the disease and hospitalization for individual patients, we can 
very likely infer what the total incidence in the population must have 
been. 

Deficiencies in the age-of-onset data and the absence of reliable data 
on the death rates of non-institutionalized psychotics would make it 
difficult at this stage to arrive at more than rough order of magnitude 
estimates of the total incidence rate even assuming the adequacy of the 
models that we are using. Nonetheless, we feel that the development of 
such models is important so that we can take advantage of more re- 
liable data when they become available and in order to stimulate a 
greater interest among epidemiologists in securing such data. 

Our interest in model construction is not confined to the immediate 
aim of securing estimates of the total incidence rate. These models, 
simple as they may be, provide a partial picture of the underlying proc- 
ess that generates a given incidence rate. They thus enable us to see 
more clearly the role of several factors in producing these rates and to 
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establish the relative sensitivity of the rates to these factors. In this 
respect the models are more illuminating than an incidence measure 
secured by direct field investigation. To be sure, the models that we 
discuss here may not be of very great interest since their parameters do 
not represent factors that are usually thought of as causes of insanity. 
The models do, however, enable us to see how the total incidence rate 
is related to such parameters as death rates and the lag between onset 
and admission. The models also permit us to deal with certain questions 
in a more unified manner. Thus incidence and prevalence measures are 
sometimes treated as rather distinct measures whose relationships are 
not clearly defined. In the model framework the incidence measure is 
clearly seen as specifying the frequency with which certain transitions 
occur between selected states of the model, and prevalency as specify- 
ing the number of persons who at any one time are in certain selected 
states of the model. 
We now turn to an examination of the models themselves. 


Ill. THREE MODELS OF THE ONSET-ADMISSION PROCESS 


The three models we will present are very simple. The stochastic 
process used is Markov and only a few states are assumed. The models 
treat only the process leading up to first admission to a mental hospital 
and therefore do not consider discharge and readmission. This is, of 
course, a usual focus for studies of mental disease. The models are 
similar to those suggested by Neyman in connection with the post dis- 
charge histories of tuberculosis patients [6]. . 


Model I: 


Our first model of the onset-admission process, Model I, makes use 
of the following states: 


So(t) = alive, sane 

Si(é) =alive, insane (mild), unhospitalized 

S2(é) =alive, insane (severe), unhospitalized 

S;(#) =insane, hospitalized 

S,(é) = dead, outside of mental institution (hospital) 


State S,(¢) refers only to deaths outside of institutions since after 
admission to a hospital subsequent changes of status are not of interest 
to the model. Similarly, we can assume that persons entering hospitals 
never leave without altering the adequacy of our model for the analysis 
of the onset-admission process for first admission cases. Thus death and 
hospitalization are considered to be absorbing states, i.e., states from 
which there is no return. It is easily seen that this is perfectly proper 
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from the point of view of the process we wish to study since no one can 
be a first admission for a second time, just as he cannot die twice. In 
this way sub-processes can be factored out for study from larger proc- 
esses and made complete by suitable conventions concerning certain of 
the states and transition probabilities. 

If we consider time intervals short enough so that only one transition 
to a different state can occur within it, the process of Model I may be 
diagrammed as follows: 





t=0 t=1 t=2 t=3 
So So So So 
Si ‘S Si 
So S Se 
S3 Si Ss 
S¢ FS, oS, 


The arrows indicate the possible changes of state that may be accom- 
plished in one step. In addition, we may characterize our model by a 
matrix of transition prebabilities p;; where p;; is the probability of 
going from S; to S; in the small basic time interval we have chosen. 
Thus, for Model I the matrix P of transition probabilities is 

‘po Pu Poe O Pos) 
0 pu’'—sOO Pis pis 
P=(|0 0 Poo} =2pess 
0 0 0 1 0 
.0 0 0 0 1 | 








where 2p:;=1 for all 7; i.e., the sum of the probabilities in any row 
must equal one. 

We may differentiate two levels of assumptions required to justify 
the application of models of this general type to the currently available 
onset data. These data, for each individual admitted to the hospital, 
consist of: (1) The age of admission, (2) The length of time between 
first onset of the disease and admission in units of one year, and (3) the 
calendar year of admission.*? Assumptions of the first kind concern the 
number of states in the model to approximate the real world and the 
possibility or impossibility of reaching one state from another directly 





8 The data we have contain more information (finer time units) about time lags of less than one 
year between onset and admission and also give month, day, and year of admission. For various rea- 
sons, however, we are unable to make use of this additional information. 
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in one step. In Model I for example this kind of assumption has to do 
with the inclusion of two states of insanity and assumptions that no 
one ever regains sanity once insane, no one becomes a severe case if he 
at any time is a mild case or vice versa. It is also assumed that the popu- 
lation does not change in any other way, for example, by immigration. 
These types of assumptions determine the size of the matrix and the 
number and position of the zeros in the matrix. The second kind of 
assumption has to do with the structure of the model itself, principally 
with the assumption of constant transition probabilities over time. In 
practice this must refer to both calendar and age time for we will have 
to group together the data for several years and several age groups. As 
is usual in empirical work we will have to be content to claim the model 
as a first approximation. As will be indicated later these claims can be 
checked to some extent by statistical means. 

In the analysis of Model I one of the first problems is to evaluate P", 
the nth power of the matrix of transition probabilities. In addition to 
the evaluation of P”, which would be sufficient to solve the usual prob- 
lem involved in the analysis of models of this sort, we must notice that 
our problem is rather special. This can be seen as follows: Let us sup- 
pose we have a cohort of persons of age x. We are interested in the rate 
(ultimately the probability of going insane at age x) at which the indi- 
viduals in this group make the transition from S, to S, or S: during the 
time period of one year, or during age x to x+1. Our data consist en- 
tirely of persons who enter S;, some perhaps as late as +2, ++3, etc. 
Thus for instance, we have to evaluate the probability of being in S, 
or S; at the beginning of age x+1 and arriving in the hospital at age 
z+1, +2, ++3, etc. For this purpose we break the process into two 
parts and define 
‘Po Pu Po O Pos 
0 Pu’':sO*8 Pis = Pus 
P=/10 0 Poo =Pas a 
0 0 0 1 0 
.0 0 0 0 1 | 
to characterize the process during age xz and 


(1 - oe’. we ee 














0 Pu 0 Pis =p 
Poo = |0 0 Po =p =e 

0 0 0 1 0 

.0 0 0 4 
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to describe the process from age z+1 on. Here we will not go through 
the algebra involved in evaluating p”, P’ Poo”, P Poo?%, - - - , ete. 
Our procedure has been to evaluate the above expressions for the 
time-discrete process and then to replace this by the continuous process 
which is its limit as the basic time intervals go to zero. The final ex- 
pressions are then written in terms of a time parameter ¢ and Ao1, doz, 
Ao, Ars, Ara, Avs, instantaneous rates of change between states S; and S;. 


Model II: 


As has been seen above, Model I may be designated as a two in- 
sanity state model; similarly Model II may be designated as a one in- 
sanity state model with recovery and relapse. In this model the possible 
states are: 


S,(t) = alive, sane 

S,(t) =alive, recovered from insanity, never admitted to hospital 
S.(t) =alive, insane, unhospitalized 

S;(t) =insane, hospitalized 

S,(t) = dead outside hospital. 


As before, S; and S, are absorbing states since we are concerned only 
with the process leading to first admissions. The inclusion of a separate 
state S;, recovered, gives us a flexibility that would not be available if 
we had treated recovery as a transition from S, to So. This simplifica- 
tion can be introduced later but for the present we will allow for the 
possibility that poz, the probability of first onset, may not be equal to 
pz, the probability of relapse. Again as in Model I we define the mat- 
rices of transition probabilities. 








‘poo O Pa O Pos | 
0 Pus Pw «CO prs 
P=/0 pu Pe Pas Pr 
0 0 0 1 0 
0 0 0 1 | 


Model IIT: 


Model III is a modification of Model I, having two insanity states 
with a non-reversible transition from the mild to the severe state. The 
possible states are as in Model I and the matrices of transition prob- 
abilities are 
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‘poo pau O 0 Pos) 
0 Pu’séuy:sCOa—sCié«éOS 
P=/0 0 \ a a oT 
0 0 0 1 0 
.0 0 0 0 1 J 


Thus no direct transition from sanity, So, to a severe state of insanity, 
So, is possible. A mild case of insanity may become severe, but not the 
reverse. Death, S,, and hospitalization, S;, are, as before, absorbing 
states. 

As can be seen, all of the above models have the same dimensionality; 
that is, they all have the same number of possible transitions. In these 
terms they are models of equivalent complexity although just which 
transitions are allowed is very important in determining the complexity 
of the possible individual case histories. Some exploratory work indi- 
cates that models of at least this dimensionality are required to fit the 
data we have. A model of the same basic type as Model I but having 
only one state of insanity has been tried. This model which we may 
designate as Model I’ is the same as Model I with states S; and S; com- 
bined. When applied to our data, it does not appear capable of produc- 
ing a good fit. The data seem to require a model which will satisfy two 
conditions: (1) a large proportion of the persons of a given age having 
an onset of mental disease are admitted to a mental institution within 
one year of the onset of the disease and (2) the frequency distribution 
of the elapsed time between onset and admission has a long tail. These 
conditions obviously suggest trying something like Model I with its 
two insane groups, one S,, with long time lags between onset and ad- 
mission, and the other, S:, with short time lags between onset and ad- 
mission. Model ITI also has the possibility of reproducing the long tail 
on the elapsed time distribution. Without some changes Model III does 
not appear to be as adequate in this regard. 








IV. PROBLEMS IN THE APPLICATION OF THE MODELS 


We turn now to a discussion of the problems of using these models 
for the accomplishment of our objectives: (1) Deriving age-specific 
rates of onset of mental disease from admission data, and (2) Construct- 
ing prevalency estimates from admission data. First, it is obvious that 
at best, even if we ignore the clear inadequacies of currently available 
data, only approximate results are possible. A really adequate stochas- 
tic process would, in addition to having a generally higher level of de- 
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tail, have transition probabilities which were, even locally, functions of 
age. With our constant transition probability models the best we can 
hope to do is locally to approximate the true process in five year age 
groups. We will return to this point. Second, a characteristic of our 
data, which is decisive with respect to certain details in the estimation 
of unknown parameters in the models, is what may be called its cross- 
sectional character. Rather than having the data pertinent to one co- 
hort followed over several time periods, we have data on several cohorts 
at one time period. This point can be made more clear as follows: Let 
N(t) be a vector describing the state of a cohort of N individuals at 
time ¢. In the case of our models the vector would have 5 elements. 
Suppose we observed the cohort at times éo, t1, f2, - - - , then our estima- 
tion procedures would be based upon the conditional probabilities 
P[N(t:+1)| N(t) ]. This is the usual form that estimation problems 
take in Markov processes. In contrast we have observations on a series 
of cohorts Ni(t), N2(t), --- at time fo, say. 

As an example of how the estimation of the unknown parameters in 
one of our models might be treated, we will sketch out the estimation 
procedure for Model I in a typical case. Let us assume that adequate 
values are available for Aos, As, and Ay. This is true for Ao, which is 
essentially the death rate q, in a suitable life table, and studies of the 
death rates of hospitalized patients shed some light on the possible 
values of Ax, and Aw. In any case the latter can always be treated as 
parameters of variations in the study and computations performed for 
bracketing values. Thus we are left with four parameters, \o1, Ao, As, 
and Ags, to estimate. We will require then at least four equations relating 
these parameters to the data for this purpose. In the particular case at 
hand we used exactly four equations in order to simplify the computa- 
tion. A single iterative method of solution was used starting with initial 
values obtained from nearby age groups. We also arranged it so that we 
have some observations left over in order that we may have an inde- 
pendent estimate of the goodness of fit of our estimated model to the 
whole of the data. This is essentially an estimation procedure based 
upon the method of moments and yields consistent estimates of the 
parameters. An alternative procedure would be to use Neyman’s mini- 
mum x? methods of estimation [7], using five or six equations, but for 
equations of the sort implied by our models this method would require 
considerably more computational effort. 

Denote by P»3(t) the probability of an individual being in S; by time 
t if he began in Sp at time zero, where time zero is defined relative to 
age X. Writing out the probabilities of an individual arriving in 8S; for 
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the first time during the yearly time periods (0, 1), (1, 3), (3, 5), and 
(5, 7) as functions of Aoi, Aoz, Ars, ANd Ags, we have [for the exact expres- 
sions see appendix] 


fi(dor, Noe, Ms, Aes) = {Pos(t = 1)} = Pi 


fal eo ) = {Pos(t = 3) — Pos(t = 1)} = Ps 
fa( dials ) = { Pos(t = 5) — Pos(t = 3)} = Pp, 
fal +++) = {Pos(t = 7) — Pos(t = 5)} = Pa 


P, is the probability of being admitted to a hospital within one year of 
onset, P2 is the probability of being admitted within three years of 
onset but more than one year, etc. All of these computations are specific 
to one age or age group; in our work we have used five year age groups. 
All of the parameters are different for each age group; the death rates 
being our best guess based upon the information at our disposal, the 
other parameters are then estimated from the data. This means that 
we end up with a model with fixed parameters being fitted as a piece- 
wise approximation to the true underlying process where the parame- 
ters must be considered as continuous functions of age. 

Thus from the Ohio data, for males having onsets between the ages 
of 35 and 39, for example, we have estimates of the probabilities 
P,, ---, Ps, denoted by Pi, Ps, P:, and P,. The problem then is to 
solve the system of equations 


FilAor, Ao2, Ars, Avs) = P, 


fal -s> ) a Ps 
Fal tale ) =P, 
Hal ves DSP, 


for Xo1, Ao2, As, ANd dos. A simple iterative procedure for doing this was 
applied to data for both sexes and age groups from 15 to 85 years of age, 
using Model I. Thus far, no numerical work has been undertaken with 
Models II and ITI. 

Using the estimated parameters one can estimate the average prob- 
ability of having an onset of mental disease in any one year between the 
ages of, for example, 35 to 39, the assumption being made that the 
underlying process is the same for all persons in this age group. For 
Model I this estimate is 


_ (Aor + Ao2) (1 — em rorro2—Aot) 
(Ao + doz + Ava) 
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Since Model I tends not to reproduce the long tail of the lag time dis- 
tribution, it therefore tends to under-estimate the proportion of persons 
having onsets who are never admitted to hospitals. A better estimate of 
the onset probability is very likely 


[ EPO ] [= hospitalized onsets with onset ao 
est. Po3( ©) Total population of age 35-39 


where est. Po3() is obtained by substituting the values of the esti- 
mated parameters into the expression for Po3( ©). Thus 


[ EPO | 
est. Po3( © ) 


can be looked upon as a correction factor to the first approximate 
estimate of the age specific onset rate (probability) obtained from divid- 
ing the total number of onsets of age z reaching a hospital by the total 
population of age x. Really one should not divide by the total popula- 
tion but include only those persons never admitted to a mental hospital, 
but this is a small correction except for the older age groups in the 
population. 

The results of the next section of the paper indicate that the correc- 


tion factor 
| EPO ] 
est. Po3() 


is not very large, especially for the age groups less than 60 years of age. 
The reason is, of course, that the death rates are so low relative to the 
rates at which the insane enter hospitals that in only few cases death 
terminates life before admission can take place. For the age groups over 
60, however, the correction seems to run from 10 to 40%, i.e., the factor 
ranges from 1.10 to 1.40. This result that the great majority of the 
insane eventually are hospitalized is not inconsistent with the results of 
prevalency studies which indicate that the number of insane found out- 
side of institutions with no previous record of hospitalization is very 
large. Some rough calculations presented in the next section of the 
paper will show that there is no conflict in the case of our results. In 
any case it is clear that since the elapsed time between onset and ad- 
mission is rather long in many cases, no contradiction in the results need 
exist. The prevalency consequences of our models should therefore be 
of considerable interest in themselves as well as forming the basis of an 
independent check on the general adequacy of the approximation the 
models offer of the onset-admission problems. 
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V. SOME RESULTS OF CALCULATIONS USING MODEL I 
A. Introduction 


In this section we wish to present in more detail some of the results 
obtained by calculations using Model I and our onset-admission data 
for Ohio state mental institutions for the years 1947-1948. Before turn- 
ing to this, some comments about the spirit in which these calculations 
were made and the use of gross epidemiological models may serve to 
clarify our feelings about these calculations. 

In general it may be said that we are in favor of a lot of rough calcu- 
lations, provided these imply some underlying modei of the situation, 
in order to obtain from available data other numbers of interest. Our 
current work is, we believe, an example of this type of thing. Psychi- 
atric epidemiology requires, for most purposes, total incidence rates 
based on age of onset rather than hospital admission rates which are 
deficient both from the standpoint of incidence and age distribution. 
We think that in the long run it is more important to attempt rough 
estimates of the rates in which we are really interested than to continue 
gathering quite exact data on admission rates. Proper caution must 
always be observed and something will be said of this below. Our feeling 
is that even models of the level of aggregation and grossness described 
earlier in this paper can produce interesting substantive results and at 
the same time serve to clarify the nature of the epidemiologic process. 

We do not, of course, believe that any model is as good as any other 
but we are quite sure that it will be some time before really good and 
also manageable models of the processes we are interested in will be 
produced. From a certain practical point of view the reason one wishes 
to have a good model is that, more or less by definition, a good model is 
one which for one’s particular purpose produces better—that is, more 
accurate—answers when used in calculations. On this view good models 
are those which allow one to predict, with sufficient accuracy, interest- 
ing numbers. There is another view which emphasizes or defines the 
goodness of models in terms of their power to unify lower order models, 
or in other words to combine in a nice way the fine structure of the 
process. Both functions of models are important; in this paper we are 
concerned primarily with the first. 

Our attitude toward the calculations presented here is that they 
must be regarded as quite provisional. Indeed, in the case of this particu- 
lar problem, where several alternative models with no less a priori 
probability of giving a good fit to the data come easily to mind, our 
feeling is that until much further investigation of their predictive power 
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is undertaken, no results should be accepted which are not model inde- 
pendent. In other words, unless several or perhaps all reasonable 
models give approximately the same answer the result is not to be 
accepted. 

We turn now to the discussion of the results using Model I. 


B. Onset-Admission Data, Ohio, 1947-1948 

The data which have made our studies possible have the character- 
istic that in addition to the usual age of admission information the 
duration of time between the initial onset of the disease and admission 
is also given. Two states now record such information: Illinois and 
Ohio. We have primarily used the Ohio data and only the work done 
with them is reported here. There are some, not too easily described, 


TABLE I 
OHIO, STATE MENTAL PATIENTS, MALE, AGE OF ONSET OF 
FIRST ADMISSION CASES DISTRIBUTED BY TIME 
LAG BETWEEN ONSET AND ADMISSION* 

















Time Lag Between Onset and Admission in Years 

Age of 

Onset 0-1 1-3 3-5 5-7 7-11 overll Total 
15-19 74 28 17 8 7 26 160 
20-24 98 48 25 11 14 35 231 
25-29 104 50 21 23 18 34 250 
30-34 152 62 31 21 20 21 307 
35-39 170 53 27 17 16 22 305 
40-44 194 68 15 13 19 26 335 
45-49 177 55 16 11 17 16 292 
50-54 148 45 16 15 15 12 251 
55-59 161 52 11 12 12 9 257 
60-64 162 82 35 13 9 7 309 
65-69 180 73 27 16 17 5 318 
70-74 120 82 33 12 10 4 261 
75-79 112 60 34 11 7 0 224 
80-84 71 32 22 3 0 0 128 
85-89 27 11 2 0 1 0 41 
90-94 4 2 0 1 0 0 7 
95-99 3 0 0 0 0 0 3 
Totals 1957 803 332 187 182 217 3678 








* Data presented here are for the years 1947-1948. The cases represented are first admissions with 
psychosis; i.e., with cases having diagnoses without psychosis and primary character disorders omitted. 
Only those cases for which information existed concerning the lag between onset and admission could 
be used; these represent approximately 80-85% of the total first admissions, with psychosis, during 
these two years. 
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reasons for believing that they are slightly more accurate than the 
Illinois data, but we are not convinced that this is the case. 

Although the information on the date or age of onset is obtained by 
asking for the dates of first onset and of current onset, the data punched 
on our IBM cards is in terms of the elapsed time between onset (first 
and current) and admission. We have segregated for study the first 
admissions cases with psychosis, and concerned ourselves with the 
elapsed time between first onset and first admission. The elapsed time 
is coded in such a way that one may distinguish time intervals of: one 
month for elapsed times of less than one year and intervals of one year 
for elapsed times greater than one year. For our purposes we had to 
calculate a patient’s age at onset by using the elapsed time between his 
onset and admission dates and his age at admission. Due to the discrete 
nature of the elapsed time code some mistakes will be made in estimat- 
ing age at onset for individual patients but the mistakes largely offset 
one another as far as the whole group of patients are concerned. Tables 
I and II give the basic data classified in the manner appropriate for 


TABLE II 


OHIO, STATE MENTAL PATIENTS, FEMALE, AGE OF ONSET 
OF FIRST ADMISSION CASES DISTRIBUTED BY TIME 
LEG BETWEEN ONSET AND ADMISSION* 


= 




















Time Lag Between Onset and Admission in Years 

Age of 

Oneet 0-1 1-3 35 5&7 7-11 overll Total 
15-19 89 30 ll 17 14 20 181 
20-24 177 47 15 20 20 24 303 
25-29 249 78 41 24 26 30 444 
30-34 204 81 36 22 28 32 403 
35-39 188 58 31 16 29 18 340 
40-44 164 64 28 26 18 24 324 
45-49 147 52 30 20 27 21 297 
50-54 139 68 28 24 13 21 293 
55-59 133 56 24 11 11 12 247 
60-64 110 62 21 11 17 7 228 
65-69 103 64 31 16 16 1 231 
70-74 83 70 34 21 16 3 227 
75-79 76 55 28 13 12 1 185 
80-84 54 32 11 6 4 0 107 
85-89 18 13 4 0 1 0 36 
90-94 5 2 0 0 0 0 7 
Totals 1939 832 373 247 252 214 3857 








* See footnote to Table I. 
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use in the models described earlier. Table VI and IX, below, show the 
changes in age distribution of first admission patients when they are 
grouped according to age of onset instead of the age of admission. It is 
the function of the model calculations to refine these approximations 
and to include estimated numbers of cases who are never hospitalized. 

There are at least two possible biases in the data which deserve men- 
tion since their existence would serve to qualify any of our results. By 
bias in the data we mean here an imperfection in the data such that 
when used in Model I biased estimates of the parameters will be ob- 
tained. Possible biases are: 

1. The older a patient is when admitted to a hospital the more likely 
it is that information of almost every kind about him will be unavail- 
able. This is often the case in data gathered upon admission to mental 
hospitals. If this were true with regard to age of onset information in 
the Ohio data the percentage of patients whose age of onset is unknown 
would be a monotonic increasing function of age of admission. The 
magnitude of the bias effect would depend upon the degree of the rela- 
tion between age and lack of knowledge. The bias in our estimates of 
Aor, Ao2, Ars, ANd Avs Would come about for the following reason: There 
will be a systematic under-representation of persons with long elapsed 
times between onset and admission. On the average, under-estimating 
the length of elapsed time leads to an under-estimate of the number of 
insane persons who are “lost” to the hospital system by virtue of 
intervening death. This in its turn would lead to an under-estimation of 
the amount by which admission rates must be corrected to secure total 
incidence age of onset rates. Fortunately we shall see that the Ohio data 
do not have this defect. 

In other cases this bias could be roughly corrected if necessary by 
distributing the admissions of any given age, whose age of onset is un- 
known, among the possible ages of onset in a manner proportional to 
those for which ti:e ages of onset are known. This was not done in our 
calculations; only the admissions with known age of onset were used 
because the foregoing type of bias was not present in the data. 


TABLE III 


PERCENTAGE OF CASES WITH AGE OF ONSET UNKNOWN 
BY AGE AT ADMISSION AND SEX 








10-19 | 20-29 | 30-39 | 40-49 | 50-59 | 60-69 | 70-79 | 80-89 | 90-99 





Male 21 21 23 20 22 
Female 20 16 16 19 19 


21 21 16 14 
16 19 20 14 
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TABLE IV 


ESTIMATED AND ASSUMED VALUES OF THE PARAMETERS OF 
MODEL I FOR OHIO, MALE ONSET DATA, 1947-1948 

















Parameters for Model I 
Age of 
onset Xo Xoo Mis Xes Au* Aut Ant 
15-19 .000106 .000114 .260 13.928 .00178 .00890 .01780 
20-29 .000233 .000167 .212 12.630 .00245 .00868 .01735 
30-34 .000315 .000283 .209 12.756 .00321 .00833 .01665 
35-39 .000279 .000341 .219 16.228 .00438 .01140 .02280 
40-44 .000236 .000438 .255 7.341 .00640 .01670 .03340 
45-49 .000196 .000414 .205 7.039 .00961 .02263 .04525 
50-54 .000313 .000330 .213 29.437 .01446 .03075 .06150 
55-59 —_ —_ Not computed — —_ — 
60-64 .000676 .000485 .394 33.008 .03117 .05500 .11000 
65-69 .000937 .000753 .281 30.035 .04570 .07810 .15620 
70-74 .001500 .000683 .324 11.125 .07384 .13475 .26950 
75-79 .002490 .00119 .237 14.247 .10392 .16595 .33190 
80-84 .002320 .00142 .379 28.805 .15238 .21100 .42200 
85-99 —_ —_ Not computed —_ — —_ 





* Death rates (sane) taken from life table, white males, 1939-41. 

+ It has been assumed that the severe cases have death rates as high outside the mental hospitals 
as all patients do within the hospitals. This assumptior does not seem to be too bad a priori and since 
the severe cases appear to make the transition to the hospital so quickly, at least in our model, an error 
in setting this rate would not have too much effect upon our calculations. Malzberg's relative mortality 
corrections were used, except for the age group 15-19 which Malzberg did not treat. In thig case it was 
assumed that the relative mortality was 10 times the normal rate. B. Malzberg, “Life tables for patients 
with mental disease,” Mental Hygiene, Vol. 16, 1932. These corrections are almost identical with those 
which would have been derived from O. Oedegaard, “Mortality in Norwegian mental hospitals 1926- 
1941,” Acta Genetica et Statistica Medica, Vol. II, 1951. ’ 

t It is assumed that mild cases have death rates one-half as high as the severe casés when outside 
mental hospitals. This is a correction in the right direction but is essentially arbitrary for there isno 
information available on this matter. 


2. We conjecture that patients, relatives and therefore clinical rec- 
ords tend to under-estimate the elapsed time between first onset and 
admission. Insidiousness of onset and other difficulties of observation 
probably lead to a postdating of the onset period. In addition people 
seem to forget or be mistaken in their remembrances of the earliest 
occurrences of onset symptoms. The very large number of cases re- 
ported with elapsed time between first onset and admission of less than 
one month probably supports our view.‘ The effect of this is to bias the 





4 Swedish data published by Larsson and Sjérgen [5] show that only 37 per cent of hospitalized 
cases had a lag between onset and admission of one year or less, whereas 52 per cent of our cases fall 
in this group. Eighteen per cent of the Swedish cases had a lag of 10 years or more, whereas only 6 per 
cent of our cases show a lag over 11 years. 
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estimates obtained from Model I in the direction of under-estimating 
the corrected onset rates; i.e., estimates of Xo: and Ao will be too small. 

3. There is great uncertainty about the death rates of the insane 
outside mental institutions, especially so with regard to the period be- 
fore first admission has taken place. The error in our rough guesses of 


TABLE V 


ESTIMATED AND ASSUMED VALUES OF THE PARAMETERS OF 
MODEL I FOR OHIO, FEMALE ONSET DATA, 1947-1948 

















Parameters of Model I 

Age of 

onset Xo Noe Mis Nes Aoa* Aut Aut 
15-24 _— — Not computed —_ —_ 
25-29 .000331 .000424 .232 15.568 .00201 .00775 .01550 
30-34 .000336 .000377 .252 10.846 .00249 .00787 .01573 
35-39 .000270 .000366 .265 17.494 .00323 .00980 .01960 
40-44 .000450 .000349 .142 10.812 .00446 .01270 .02539 
45-49 .000362 .000330 .151 11.058 .00643 .01440 .02880 
50-54 .000508 .000371 .144 6.708 .00945 .01805 .03610 
55-59 .009305 .000436 .273 6.657 .01421 .02275 .04550 
60-64 .000391 .000459 .267 4.666 .02279 .03465 .06930 
65-69 .000706 .000480 .229 5.976 .03438 .04505 .09010 
70-74 .001462 .000461 .199 8.066 .05561 .06900 .13800 
75-79 .001916 .000644 .239 18.208 .08854 .10850 .21700 
80-84 .002237 .000904 .240 21.767 .13557 .16103 .32205 
85-94 _ —_— Not computed _— —_— 





* Death rate (sane) taken from life table, white females, 1939-41. 
t See note ¢ to Table IV. 
t See note t to Table IV. 


the death rates of the insane outside institutions is almost certainly 
more important than the structural bias of the Model that we have 
just. mentioned. 

4. Apart from these data biases the Model itself has a structural 
bias. This bias is the result of holding death rates constant when in fact 
they are increasing. For example, in our calculations a man who has his 
first onset at age 50 and is first admitted at age 60 is treated as though 
he were a member of a cohort which over this ten year period was sub- 
ject to death rates appropriate to ages 50-54 rather than to death rates 
which increase each year. This again tends to produce under-estimates 
of Ao1 and Noe. 
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C. Results of Computation with Model I 


In Tables IV and V are given the results of our calculations using 
the Ohio data, 1947-1948, and Model I. The assumed death rates dq, 
Aus, and Ay are given as well as the estimated values of Aoi, Ao2, A13, and 
eos. A few cases were not computed. This was because for those age 
groups the data had a character which made it inappropriate to the 
model and the technique of computation we were using. Thus among 
females, ages 15-19, there are 11 cases with 3-5 years elapsed times 
between onset and admission and 17 cases with 5-7 years elapsed times 
between onset and admission, whereas our method of calculation re- 
quires that the number of cases be a strictly decreasing function of 
elapsed time. If this is not so inadmissible estimates of some of the 
parameters are obtained; i.e., negative transition rates between states 
are obtained. In other models this may not be required and with other 
computational methods certainly will not be required. For example, if 
we had used Neyman’s minimum x? methods with five or six elapsed 
time groups rather than the method of moments technique with four 
elapsed time groups this exclusion would not have been necessary. 

In order to judge how variable from sample to sample the estimates 
Of Aoi, Aoz, Ars, And Ass might be some rough estimates of their sampling 
variances have been made. Rather than differentiating the expressions 
for Po3(t) —Pos(t—j) with respect to Au, - - +, Aes we obtained an ap- 
proximation of the appropriate quantities by solving linear equation 
systems derived from the data. From age group to age group the 
P,(t), - - +, Pa(t) change (¢ designating the age group) producing con- 
comitant variation in the estimates Xm, - - -, \%. Solving the obvious 
equation systems one obtains, vector by vector, a rough estimate of 


AAn AA PAA AA 
OP, OP, OP; OP, 
Oro2 ~AAo2 AAo2 AAoz2 
OP, OP. OP; OP, 
Oris OAis OAis OM 
OP, OP, @P3 OP, 
Or23 Odes OAz3 OAzz 
| OP; OP, OP; AP, 


the matrix over the range of ages considered. In our case the computa- 
tions were performed using the data for males ages 30-34 to 50-54. The 
covariance matrix 2 of the vector of estimates (Ag, - - - , Aes) is then 
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where 2 is the covariance matrix of P,(t), P2(é), P(t), and P,(t). Be- 
cause of the cross-sectional nature of our data the P’s are not inde- 
pendent, being very slightly negatively correlated. Therefore consider- 
ing them es independent will not give too bad an estimate. We find 
the approximate standard errors to be as follows: 





Average value Estimate 


Parameter (30-34 to 50-54) standard error 
he .000268 .000078 
des .000361 000045 
ia 220 .029 
fine 14.560 17.46 





Because the other parameters, i.e., Aos, Ara, ANd Avs, Also Were varying, the 
estimates of the standard errors must be taken as extremely rough. 
Nonetheless as far as the estimates of Ao and Ao: are concerned the 
evidence is that they do not have very large variances relative to their 
absolute values. In addition we are interested in the estimate of \o1+Ao2 
rather than Ao: and oz separately. Ao: and Ao2 have a negative covariance 
so that Ao1+Ao2 has an estimated standard error of .000045, which com- 
pared with the average value of Xo1+Aoz of .000629 is a percentage error 
of 7.2 per cent, certainly a very reasonable level of accuracy. 














TABLE VI 

Estimated number of Observed number of 

Onset age group cases, lag 7-11 years cases, lag 7-11 years 
15-19 5.6 7 
20-24 6.9 14 
25-29 — -- 
30-34 23.8 20 
35-39 17.4 16 
40-44 21.1 19 
45-49 12.9 17 
50-54 27.4 15 
55-59 _— _— 
60-64 7.4 9 
65-69 15.1 17 
70-74 6.0 10 

Mean value 12.80 14.40 
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In general the fit of the model to the data appears fair. As a test of 
goodness of fit we have used the data for the various age groups on the 
number of cases with a time lag of from 7 to 11 years between onset 
and admission. The data for the males give the results in Table VI. 
Since the standard error in each case is essentially equal to the square 
root of the expected number of cases, the discrepancies between ex- 
pected and observed numbers of cases is close in most cases. The value 
of x? for the ten age groups is 18.752. The probability of exceeding this 
number with 10 degrees of freedom is approximately .04. In view of the 
known tendency of the model to underestimate the tail of the lag dis- 
tribution the results are not discouraging. The fit to the female data is 
similar. 


D. Estimates of True Onset Rates Derived from Model I Calculations 


The main substantive result we have hoped to produce is a set of 
estimates of age specific (first) onset rates. Even given the basic results 
of the calculations there are several alternative estimates one might 
produce. From among these we have chosen as giving the best estimate 
the expression 


[ EPO ] [—= hospitalized onsets of age *] 
est. P o3( ice) ) 





Total population of age x 


evaluated for each age group. By EPO we mean the estimated prob- 
ability of having an onset at say age x (during the year in which the 
person is of age x) obtained by substituting in the appropriate expres- 
sion for Model I 


(Aor + Aoz)(1 — |emo1Ao2—Aos) 
(Ao + Noe + Noa) 


the estimated and assumed values of the parameters. By est. Po3( ©) 
we mean, similarly, the estimated probability of ever being admitted 
to a mental hospital based on the Model I results. The ratio of these 
two estimates represents a correction factor with which to multiply the 
crude estimate of the onset probability at age x. The expression for this 
correction factor is 


(Aor + Ao2) (Ars + Ara) (Aos + Aca) 
AorArs(A2s + Ava) + AogAe3(Arz + Ara) 


In Table VII are given the estimated values of this correction factor for 
the various age groups, by sex. For the ages below 65 not much of a 
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correction is involved but above this age it is substantial. This reflects 
principally the rapid increase in the death rates above this age since in 
general the average elapsed time between onset and admission tends to 
decline very slightly as age of onset increases. 

Since readers will be more familiar with age specific first admission 
rates, we will derive our final correction factors so that they may be 


TABLE VII 


ESTIMATED RATIOS OF ONSET PROBABILITY TO ADMISSION 
PROBABILITY OF PERSONS HAVING ONSETS AT A GIVEN 
AGE, MODEL I, OHIO 1947-1948 

















Ratio of onset probability to admission probability 
Age group 
Male Female 

15-19 1.02 _— 

20-24 — 

25-29 <- 1.02 
30-34 1.02 1.05 
35-39 1.02 1.02 
40-44 1.03 1.05 
45-49 1.04 1.05 
50-54 1.07 1.07 
55-59 —_ 1.04 
60-64 1.08 1.07 
65-69 1.20 1.12 
70-74 | 1.27 1.25 
75-79 | 1.40 1.31 
80-84 | 1.29 1.41 








applied directly to admission rates. In Tables VIII and IX are given 
the ratios, by age group and sex, of the number of first admissions in a 
given age group to the number of first onsets in the same age group. 
Since Tables VIII and IX include only hospitalized patients the num- 
ber of onsets is too small. By applying the correction factors in Table 
VII we can secure estimates of the total number of onsets in the popula- 
tion at large. The correction factors to be applied to age specific first 
admission rates in order to estimate total age specific onset rates can be 
obtained by dividing the entries in Table VII by the appropriate entry 
from Tables VIII and IX. The final results of this operation are given in 
Table X. 

From Table X it appears that the very largest corrections are for the 
ages below 30. In the age group 15-19 hospital first admission rates 
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under-estimate the total onset rate by about 70 per cent and for the age 
group 25-29 by about 21 or 22 per cent. For the ages from 30 or 35 
to 65 the average correction is small (6-7 per cent for males and 1 per 
cent for females). In this region first admission rates are evidently good 
estimates of the onset rate, assuming that the biases discussed earlier 
have not too great an effect. For the ages over 65 the picture is mixed. 
For a period of 10 years (65-75) in the case of females and 15 years 
(65-80) in the case of males the correction indicated is a 20 per cent 


TABLE VIII 


RATIO OF FIRST ADMISSIONS IN EACH AGE GROUP TO THE 
ONSETS IN EACH AGE GROUP, OHIO MENTAL PATIENTS 
(FIRST ADMISSIONS, WITH PSYCHOSIS), 

MALE, 1947-1948 














Admissions 
Age groups Number of onsets Number of Ratio = 
ad missions Questo 

15-19 160 96 .60 
20-24 231 164 Re 
25-29 250 211 .84 
30-34 307 276 91 
35-39 305 320 1.05 
40-44 335 253 .75 
45-49 292 316 1.07 
50-54 251 262 1.04 
55-59 257 283 1.10 
60-64 309 298 .97 
65-69 318 319 1.00 
70-74 261 266 1.02 
75-79 224 260 1.16 
80-84 128 170 1.33 
85- 51 82 1.61 
Total 3678 3678 — 














increase. After this period the correction becomes small as far as our 
estimates indicate, but they are very unstable for these ages. 

For the very young age groups the large correction factor (to adjust 
admission rates as estimates of onset rates) is entirely the result of the 
shift to age of onset as the accounting variable. Admission rates are 
rising sharply between the ages of 15 and 30 and the shift to age of 
onset means a shift of many more cases into these age groups than out 
of them since the lag between onset and admission does not vary much 
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with age. Death rates are very low and few cases are lost between onset 
and admission. For the middle age group, 30 to 65, no correction is 
found because admission rates are relatively stable over this period and 
the reallocation of cases on the basis of age of onset merely replaces one 
case for another on the average. Death rates still do not have much 
effect. For the older age group, on the other hand, two factors produce 


TABLE IX 


RATIO OF FIRST ADMISSIONS IN EACH AGE GROUP TO THE 
ONSETS IN EACH AGE GROUP, OHIO MENTAL PATIENTS 
(FIRST ADMISSIONS, WITH PSYCHOSIS), 

_ FEMALE, 1947-1948 

















Admissions 
Age groups Number of onsets Number of Ratio = 
admissions Questa 

15-19 181 108 .60 
20-24 303 257 .85 
25-29 444 367 .83 
30-34 403 397 .99 
35-39 340 361 1.06 
40-44 324 312 .97 
45-49 297 315 1.06 
50-54 293 293 1.00 
55-59 247 292 1.18 
60-64 228 240 1.05 
65-69 231 228 .99 
70-74 227 233 1.03 
75-79 | 185 232 1.25 
80-84 | 107 142 1.33 
85- 43 80 1.86 
Total 3857 3857 _— 











the rather substantial correction factors: admission rates are again 
rapidly increasing and death rates are high with consequent high at- 
trition between onset and admission. 


E. Some Comments on the Consequences of the Calculations for Prevalency 
Estimates Derived from Model I 


A matter of some interest, both as an independent check on the re- 
sults of the computations and adequacy of the model and as an addi- 
tional result of substantive interest, is the prediction, from our calcula- 
tions, of one type of prevalency rate for the various age groups. By age- 
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specific prevalency rates we mean the total number insane on any given 
date in the population within given age groups. Mental hospital reports 
provide, of course, exact data on the number of hospitalized insane in 
the various age groups. Our problem is to supplement these numbers 
with estimates of the number of insane not hospitalized. The insane 
outside of hospitals may be grouped into two classes: (1) Cases not 
previously hospitalized, (2) Cases previously hospitalized and now 


TABLE X 


TOTAL CORRECTION FACTORS FOR AGE-SPECIFIC FIRST 
ADMISSION (WITH PSYCHOSIS) RATES, OHIO, 1947-1948, 
TO OBTAIN AGE-SPECIFIC ONSET RATES 

















Total correction factors for first admission 
Age groups 

Male Female 
15-19 1.700 1.700* 
20-24 1.436 1.200* 
25-29 1.215 1.229 
30-34 1.122 1.061 
35-39 .971 .962 
40-44 1.373 1.082 
45-49 .972 .991 
50-54 1.029 1.070 
55-59 -881t .881 
60-64 1.114 1.019 
65-69 1.200 1.131 
70-74 1.245 1.214 
75-79 1.207 1.048 
80-84 .970 1.060 








* These female age groups are assumed to have the sameentriesin Table VII asthe corresponding 
male age groups. 

t+ This male age group is assumed to have the same entry in Table VII as the corresponding female 
group. 


again insane. In our models the not previously hospitalized group is 
represented by the members of the various unhospitalized insane states; 
e.g., in Model I members of states S; and S2, in Model II members of 
state S:. None of our models has provision for discharged cases. The 
actual calculation for Model I of Po(t) +Po2(#), the probability of being 
in either state S, or S: at time (age) ¢, would be quite tedious except for 
say the ages 15-19; even then one would have to assume that the onset 
rate was zero up until age 15. For other ages to do the calculations 
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correctly it would be necessary to piece together the contributions from 
all of the processes up to the age required. Thus, if one wished to esti- 
mate the prevalency rate for cases not previously hospitalized at age 
40, one would have to compute the contribution of the process during 
the ages 35-39 with its set of parameters, compute the contribution of 
the process during the ages 30-34 with its set of parameters, etc. Even 
this would only be an approximation due to the assumption of constant 
parameter values over time which has been mentioned earlier. While 
all this is true, by the use of ruthless approximations something can be 
said about the number of cases one would expect to find in a prevalency 
survey that have not previously been treated in hospitals for the in- 
sane. This does not solve the problem of computing total prevalency 
rates but does offer a check of our onset calculations and is a result of 
some direct interest. Oedegaard has made some similar calculations 
with a simple model for Norwegian data [8]. 

If we collapse Model I into Model I’ and average the onset and lag 
parameters in the obvious way, we get the following approximations. 
First, notice that the admission rate at any time ¢ is 


d8S;(t) - dP o3(t) 











= his(é)S. (4); 
r- ,° 13(t)Si(t) 
thus 
First Admission Rate at ¢ 
Si(t) ~ 
huis(t) 
where 
nef at ae EE | 
Sua(t) = - Ais . hes | 
“ Mer + Nes :* 


This will make the average elapsed time between onset and admission 
the same in the two processes: Model I and Model I’. In Table XI are 
given the values of \;3 for the various age groups, by sex. Mean elapsed 
time between onset and admission, i3, is equal to the ratio of the age 
specific first admission rate to the proportion of the population of given 
age, insane but not previously hospitalized. The values obtained, 
shown in Table XI, check rather well with some earlier calculations of 
ours based on World War II Selective Service data for the male age 
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group 18-34 where we estimated this ratio to be approximately 2.23. 
Oedegaard’s calculations lead to an average value of Xi; on the order 
of 3. 

In order to obtain what are ordinarily thought of as age-specific 
prevalency rates it would be necessary to compute from other types of 
data the remaining components: The proportion of persons in some age 
group that are residents of mental hospitals and the proportion of per- 
sons, insane, not in hospitals, and with a record of previous hospitaliza- 
tion. Our models are not of much help in obtaining estimates of either 
of these two components. An estimate of the first component is readily 
available from hospital records, but the second component would have 
to be estimated from discharge and readmission data, or perhaps from 
some prevalency survey of a follow-up, case study type. While our 
particular models do not contribute to the determination of the re- 
quired estimates, models of the same type might prove very useful in 
the analysis of the discharge-readmission process. 


TABLE XI 


PREVALENCY IMPLICATIONS OF MODEL I’ WITH USE OF 
ESTIMATED PARAMETERS, OHIO, 1947-1948, PERSONS 
WITH PSYCHOSIS 




















Age specific first admission rate 
Values = 
of 1: Proportion of population of given age, 
Age group insane but not hospitalized by specified age 
Male Female 
15-19 1.88 — 
20-24 —- 
25-29 } tae 2.47 
30-34 2.57 1.92 
35-39 2.09 2.00 
40-44 1.46 3.99 
45-49 2.15 4.15 
50-54 2.32 4.05 
55-59 — 1.59 
60-64 1.49 1.86 
65-69 1.97 2.64 
70-74 2.15 3.84 
75-79 2.69 3.15 
80-84 1.65 2.97 
Average Xi 2.10 2.89 
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APPENDIX 
Analysis of Model I 


In this Appendix we work out in more detail the probability expres- 
sions required for the application of one of the three models to admis- 
sion data. 

In order to achieve our ultimate goal in the analysis of onset and 
admission data we must evaluate the probability of an individual being 
in S; at the end of a time period consisting of n basic time intervals, 
given that the individual was in S) at the beginning of the period. Thus 
we will assume that for the particular age group under consideration all 
individuals are in state Sp at the beginning of a certain time period. 
After n days (at the end of time period ¢,), and taking one day as the 
basic time interval, we would like to know how many persons in this 
age group will be in mental hospitals. This is equal to the number of 
persons in the cohort multiplied by the probability of an individual of 
the cohort going from Sp at the beginning of t, to S; by the end of ¢. It 
is the great virtue of the matrix formulation that it leads quite easily 
to the evaluation of such probabilities. Let us define p;;“ as being the 
probability of being in state 7 after n steps if one began in state 7. The 
pi;" are the 7, jth elements of the nth power of the matrix so that in 
Model I 


(Poo Po ~— Poo Pog Pog) 
0 Py™ 0 Pi; Py™ 
p" = | 0 0 P2a™ Pa™ Py 
0 0 0 1 0 
(0 0 0 0 . | 








Special techniques are available for the evaluation of the nth powers of 
matrices, especially the elements of P” associated with absorbing states 
in rather simple matrices [1]. 

As indicated in the body of the text the process is broken into two 
parts characterized by the matrices 


(Po Pu Po 0 Pos) 
0 Pu O Pigs Pr 
0 oO Po Pes Pal, 
0 oO 0 1 0 

0 0 0 0 = 
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and 


( 


1 eo 6 6 @) 

0 P,; O Pis Pu 
Poo =i| 0 0 Po P23 Pa 

0 0 0 1 0 
.0 0 0 0 3 


We are to evaluate P%5, P%5P,,35 ... ; P%5P, 95K .... Let usfor the 
present replace 365 with a generalized N. As a first step we obtain 
Pa™, Po, and Pos, the latter expression first. Define 4, as the 
probability of being in S; for the first time at the nth step, n= N. Then 








An = Pa pris + por pos, 


or, verbally, \,, equals the probability of being in S, at the n—1th step 
and going from S, to S; at the nth step or the probability of being S; 
at the n—1th step and going from S; to S; at the nth step. Clearly the 
probability of being in S; at the Nth step is equal to >-*_,\, since Ss is 
an absorbing state. We find that 


(pu*—* — poo") pa 

















Pa") = 
(pu = Poo) 
_ (p22"—! — poo") Dos 
Po*”’ = 
(poe = Poo) 
and, thus 
— (pu"—! — poo") papis i" (p22"~? — poo"')Porpes 
- (pu — Poo) (p22 — Poo) 
pos = >-%_.X, and since these sums involve only geometric series, we 
have 
eit an Poipis [= - me 4 Do2p23 E _ me | 
(1) (pu = Poo) l= Pi (poe = Poo) 1l— P22 








| Pops 4 Porpr3 ] k — pooY | 
(pu — Poo) (p22 — Poo) 1 — poo t 
It is convenient to replace this expression for the discrete process by its 


analog for a continuous process. In order to do this we shrink the basic 
time intervals (of one day) down toward zero, keeping constant, at say 
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one year, the total length of the time period during which the process 
has been going on. In order to do this we let \;;= N pi; for 147. In the 
limiting procedure when N—~, p,; will approach zero since the );; 
remain fixed. Our limiting procedure implies the additional restrictions 
Poo = 1 — po — Por — Pos 
Pu = 1 — pis — Pra 
Poo = 1 — pos — Pr 
and, in the limit, the probability of remaining in a given state ap- 
proaches one. After substituting ;;/N for pi; in (1), and taking the 
limit® we have 
Aorrral 1 _ e~s-u4] 
(Nor + Aoz + Aos — Ars — Asa) (Ars + Asa) 
Deodes! 1 = e~r23-d24 | 
(Aor + Aoz + Aos — Azz — Aza) (Aza + Aca) 
AorArs[1 — emrorroz—ro« ] 
(Nor + Aoz + Aos — Ars — Asa) (Aor + Ao2 + Aoa) 
Nodes {1 _ e~ro1—hoz—ho4] 


Qa > hee hea la ~ Dele oe oe 





Post = 1) = 





a 
(2) 








This is the probability of being in S; at the end of the first year for a 
person in Sp at the beginning of that year. If we were interested in 
dividing this one year period into more than one smaller period, a “t” 
should appear in the exponent of all the exponential terms. 

We must now turn our attention to those cases who while having 
their onset at age x do not enter the hospital (or S,) until ages ++1, 
x+2, etc. As before, we define \, to be the probability of being in S; for 
the first time at the nth step, this time n2N. Thus, 


‘ = Pa pu %A-Y p13 + por p22"-N—)) 995. 


We see that 
pu -N-) ie pu"—*-1, 


P22 "—N—-1) = pa" 5-1, 


x 
Lim | 1 -— =o, 
N-ow N 
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Substituting and summing, 














~ | Sed sila, Pas(1 — pn’) 
Dd A» = pa” = = + por'®? —— : 
n=N+1 1 — pu 1 — pu 
thus, 
N — N 
pn) = (pu Poo”) par 
Pu — Poo 
pa” = (poo — Poo” ) Poa 
02 = ‘ 
P22 — Poo 


Substituting and taking limits as before, we have 
Pos(t > 1| 8 or s2 at ¢ = 1) 


— e~Oistr14) (t—1) —A1s—A14 — p—A01—A02—Aoe 
o1A13 e e € 


(3) (Aor + Aoz + Aos — Ars — Ara) (Ars + Ara) 
Aorde3[1 — e~ Aasthaa) (1) | [e—dasAae — e~dor-horho« ] 
7 (Aor + Ao2 + Aos — Avs — Ava) (Avs + Aza) 


where 21. This represents only part of the probability of being in s; 
and the total cumulative probability of being in S; at {21 is equal to 


pos(t = 1) + pos(t 2 1| S; or S: at ¢ = 1). 








Model I’ 


A simplified model of the onset-admission process can be obtained 
from Model I by removing one of the insanity states. Let us remove S:, 
then the process is characterized by the matrix of transition probabili- 
ties 
[po pa O Pos | 
0 Pues‘iaesCé«é*™‘NA 
0 0 1 0 
.0 0 0 1 


Thus, since Ao2 = Ax = Aas = Aum = 0, 
Aordis[1 — ess] 


sal (Nor + Aos — Ars — Ara) (Ars + Ara) 
AorAr3[1 = em horror | 








4 





Poa(t = 1) 
(4) 





”" ie 4 hs ~ es = ee eh 








a 
¢ 
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and 
post 2 1| s at ¢ — 1) 


AorAra[1 -_ e~ Orsth1a) (#1) ] [e-ds—ru aa e7or— roe ] 
(Aor + Nos — Ars — Ana) (Ara + Ata) 


Similar results have been obtained for Models II and III but since 
no computations have been done using them it was not thought worth 
while to present them here. The expressions for P,;(t), etc. are sub- 
stantially more complicated, especially in the case of Model ITI. 
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MAXIMUM LIKELIHOOD AND MINIMUM x ESTIMATES 
OF THE LOGISTIC FUNCTION 


JOSEPH BERKSON 
Mayo Clinic and Mayo Foundation* 


N MANY situations commonly encountered in statistical practice, the 

minimum x? and the maximum likelihood estimates are identical. 
An example is the estimation of P=1—Q, the proportion of a popula- 
tion possessing some specified characteristic; this may be presented in 
a simple way: 

The Pearson x? is given by 


oa 2 
go ge 222. () 
e 
where o is the observed frequency and e is the expected frequency. 
In a random sample of N from the population, the expected fre- 
quency of individuals having the characteristic is PN, and the expected 
frequency of individuals not having it is QN. If we observe r “haves” 
and s=N-—r “haven’ts,” and define an observed p=r/N, and an ob- 
served gq=1—p=s/N, the x? table is 




















Have | Haven’t Total 
Observed pN qN N 
Expected PN QN N 





(pN—PN)? (gN-QN)? WN 
2 = = — (p — P)?. 2 
x PN + ON PQ ) (2) 
If we estimate P on the principle that we shall take that value which 
minimizes the x? of (2) and designate the estimate as # we obtain 


b=p=r/N. (3) 


The estimate (3) is well known to be also the maximum likelihood 
estimate of P, if p is assumed to be binomially distributed around the 
true P.? 


As in the case with the estimate of the binomial parameter, the mini- 








* The Mayo Foundation is a part of the Graduate School of the University of Minnesota. 

1 Unless otherwise specified, the x? referred to is the classic x? of Pearson. Appendix Note 1 should 
be read for the definition of the minimum x? estimate. 

2 It is worth noting that in deriving the x? estimate, we needed only to be able to write the expecta- 
tion PN, but to derive the maximum likelihood estimate, we needed to assume that the observed p is 
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mum x? estimate of the Poisson parameter A from a sample of the dis- 
tribution 





(4) 


can easily be shown to be Z, the mean of the sample, and this is also 
the maximum likelihood estimate. 

There are, however, some situations, which also occur not infre- 
quently in practice, in which the minimum x? and maximum likelihood 
estimates are not identical. Even in these situations, as the number in 
each sample increases indefinitely, the sampling distribution of both 
estimates becomes more nearly the same. Both estimates are “efficient” 
in the sense of Fisher [16], as well as “best asymptotically normal” in 
the sense of Neyman [20]. These characteristics, however, refer to 
asymptotic properties, which are in the realm of what Fisher calls 
the “theory of large samples,” where, to quote his words, “nothing 
that we say shall be true, except in the limit when the sample is in- 
definitely increased; a limit, obviously, never attained in practice [13].” 

For finite samples, which is to say for any real statistical samples, 
small or large, the estimates may differ in their distributions, and the 
question arises, “Which is the better estimate?” Little or nothing is 
reliably known which will provide an answer to this question, although 
conjectural opinions in favor of the maximum likelihood estimate are 
frequently expressed. For the last several years, I have been trying to 
accumulate some information, intended to help to clarify the problem, 
by calculations and experiments with actual samples, referring to situ- 
ations which I have encountered in practice. The present article re- 
ports the results of one series of such experiments. 

The parameters to be estimated are regression coefficients, or per- 
haps it is better to say the coefficients in a functional equation, and 
the equation concerned is the logistic function with binomial variation 
of the dependent variate, which has been advanced [1] as a model for 
bio-assay with quantal response, 


1 1 
b=1-Q= - (5) 


1 + e-(atses) ’ 1 + e-(atbzs) 








P;=1-Q;= 





distributed binomially in random samples. It is an important methodological advantage of least- 
squares types of estimates, over the maximum likelihood estimate, that the calculation of the maximum 
likelihood estimate requires complete knowledge of the distribution, in order to enable one to write 
the specific probability which is to be maximized, whereas the method of least squares generally re- 
quires only some limited knowledge about some functions of the distribution, perhaps only the mean, 
perhaps the mean and variance. 














132 AMERICAN STATISTICAL ASSOCIATION JOURNAL, MARCH 10955 


The straight line transform of this function, called the logit, is given 
by 
logit P; = In P;/Q; = a + 82. (6) 


The equations to be solved for the maximum likelihood estimate of 
a and 6 are 


dX ni(pi — pi) = 0 (7) 
> nixi( ps _ pi) = 0, (8) 
where n; is the number at z;, p;= 1—4g; is the proportion of n; observed 


to respond, and #; is the estimate of P;. 
For the minimum ,? estimate the equations to be solved are 


Tn, (pq: + Gipi) 








Bid (pi — i) = 0, (9) 
Ln oe i fee ai(pi — pi) = 0. (10) 


Since the coefficients of these equations are functions of the p which 
is to be estimated, as well as because # is not linear in the parameters, 
in general, although not always, both the minimum x? and maximum 
likelihood estimates require iterative methods for solution.’ For situa- 
tions met routinely in the laboratory, the iterative procedure is un- 
satisfactory, not only because of the large amount of arithmetic labor 
involved, but because, since the solution is only approached as a limit 
which is never really reached, there is a problem of defining the esti- 
mate which actually is attained [1]. 

In 1944 I [4] suggested a noniterative solution which I then desig- 
nated as “least squares,” but which is presently called the “minimum 
logit x? estimate,” defined by minimization of the following quantity, 
called the “logit x?.” 


x? (logit) = Dapagilli — i)? (11) 


where J;=In p;/q; represents the observed logit, and 1;=a+bz, repre- 
sents the estimated value of the logit. The normal equations for obtain- 
ing the minimum logit x? estimate of a and 8 are 


Dd npigi(li — :) = 0, (12) 
> nipgqa(l; — i) = 0. (13) 





3 See Appendix Note 2 for methods used in sampling and calculation. 
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The evaluation of (12) (13) leads to a procedure that amounts simply to 
obtaining a least-squares solution of the straight line 


t=a+ bz;, (14) 


using 7,p.q; as weight of the observation /,;. Since the coefficients are 
in terms of the known observations, and / is linear in the parameters, 
(12) (13) can be solved directly and simply. 

The quantity on the right hand side of (11), which is the logit x?, is 
asymptotically distributed as x?, as is the classic Pearson x?, and for 
small numbers is very close to the Pearson x? [3]. Taylor [28] has pre- 
sented the proof that the minimum logit x? estimate falls in the class 
of estimates R.B.A.N. of Neyman—‘regular best asymptotically 
normal”—and that it therefore has the same asymptotic properties 
as the maximum likelihood estimate. 

We may now turn to an inquiry into the properties of these estimates 
for finite samples. 

The criteria for judging which of several estimates is “best” are of 
course conceivably many. One criterion frequently used is the com- 
parative size of the mean square error, that is, the expected value of the 
squared difference of the estimate from the true value of the param- 
eter; this is the one used in the present investigation. 

The definitive part of the investigation was made for an experiment 
with three values of x; equally spaced, unit distance apart and with 
n;=10 for each z;. The true P’s at the three successive positions of 2; 
were initially taken as 0.3, 0.5, 0.7, which defined the values of the 
parameters as a=0, 8=0.84730. The program of experiments was in 
two sections, one in which 8 was considered known and only a@ to be 
estimated, the other in which a and 8 were to be estimated simultane- 
ously, these two situations simulating actual practical conditions in 
the field of bio-assay. 

With 10 animals exposed at x;, there are 11 possibilities of number of 
animals responding, and with 3 doses there are in all 1,331 possible 
samples. For the case with 8 known, a to be estimated, the statistics 
were calculated for the entire sampling distribution; that is, the esti- 
mates were evaluated for all 1,331 possible samples, and these were 
weighted by their appropriate probabilities to yield the required statis- 
tics. For the case with both a and £ to be estimated, the total sampling 
distribution was evaluated in the case of the maximum likelihood and 
minimum logit x? estimates, but because of the laborious calculations 
involved with the minimum Pearson x? estimate, a stratified random 
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sample of 1,000 was used, for each of the dose arrangements for which 
experiments were performed.‘ 


MAIN EXPERIMENTS; 8 KNOWN, @ TO BE ESTIMATED 


For the first experiment the value of P corresponding to the central 
one of the 3 doses was 0.5. This experiment, disposed symmetrically 
around P=0.5, can reasonably be considered the model experiment, 
since in the design of a bio-assay experiment, one centers the dosages as 
nearly as possible at the E.D. 50, the sampling errors of the estimates 
being smallest with this arrangement. Additional experiments were 
performed with central P at 0.6, 0.7, 0.8, and 0.85. The results are 
summarized in Table 1.5 It is seen that at central P=0.5 all three esti- 
mates are unbiased and that the variance is smallest for the minimum 


TABLE 1 
ESTIMATE OF a, 8 KNOWN 














True P at dose Mean Variance Mean square error 
Min. Min. Min. | Min. Min. | Min. 
Low | Mid | High Max. Pearson} logit Max. Pearson| logit || ©: | Pearson! logit 
lik. x? x? lik. x x? lik. x x 
3 5 Pe 0 0 0 .158 .139 . 137 .158 .139 .137 
.391 -6 -778 || +.012 |—.017 | —.026 . 164 .143 .141 - 164 .144 -141 
5 i -845 || +.028 |—.036 | —.055 . 186 .161 .156 . 187 .162 .159 
-632 8 -903 || +.056 |—.059 | —.097 .246 -207 . 187 .249 -211 .196 
-708 -85 | .930 || +.080 |—.077 | —.141 .305 .249 .201 .312 .255 221 















































Logistic function 8 =0.84730 known, a =0 to be estimated. 3 equally spaced doses, 10 at each dose. 
Comparison of statistics of the three estimators for various positions of the dosages. Based on total 
sampling population. The two samples with observed p’s respectively 0, 0, 0, and 100, 100, 100 per cent, 
yield an infinite estimate by maximum likelihood and were omitted in calculation of all the estimates. 
The fraction of the total population constituted by such samples is very small, 0.3 per cent for the 
experiment with central P =0.85 and less than that for the other experiments. 
logit x? estimate, next larger for the minimum Pearson ,? estimate, 
and largest for the maximum likelihood estimate. For all other dose 
arrangements each of the estimates is biased, while the mean square 
error, and also the variance about the mean, are again smallest for the 
minimum logit x? estimate, largest. for the maximum likelihood esti- 
mate, with the minimum Pearson ,? estimate falling between the two. 


MAIN EXPERIMENTS; a AND 6 BOTH TO BE ESTIMATED 


For comparison of the mean square error of the estimates, experi- 
ments were performed with central P at 0.5, 0.6, 0.7, 0.8, but were not 





* See Appendix Note 2 for methods used in sampling and calculation. 

5 Results are shown for central P 20.5. For experiments with central P <0.5, the results are the 
same as for the symmetrically placed experiment in which P >0.5, with only the modification that there 
is a change of sign of bias where there is bias. 
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done with central P beyond P=0.8, because with the two parameters 
to be estimated samples insoluble by maximum likelihood become too 
frequent.* The comparisons are shown in Table 2. The characteristics 
of the findings are worth noting in a few details. 

For the estimate of a, with dosages disposed symmetrically around 
P=0.5, the estimates are unbiased, as they are in the case where only a 
is to be estimated.’ For all other dose arrangements the estimates are 
biased even as was seen to be the case with only a to be estimated. 
However, whereas in the case of a only to be estimated, the bias of 
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TABLE 2 
True P at dose Mean Variance Mean square error | Percent 
samples 
Min. , Min. . . . | insoluble 
Low | Mid | High Max. | Pear- — Max. | Pear- — Max. nes pein by 
e lik. son _ lik. son = lik. — max, 
: x - x? son x A 
x x $ lik. 
x 
ESTIMATE OF a 
3 5 oe 0 .002 0 -187 | .179 | .154 .187 | .179 | .154 0.07 
.391 .6 -778 || —.006) —.013) —.020)| .230 | .218 | .205 .230 | .218 | .206 0.11 
6 Py -845 || —.021) —.011| —.013)| .430 | .412 | .393 .430 | .412 | .394 0.54 
-632 8 .903 || —.026) .037) .084//1.102 | .970 | .682 |/1.103 | .972 | .689 3.91 
ESTIMATE OF £6 
3 5 Py 095} .0624) .048|) .313 | .276 | .268 .322 | .280 | .271 
-391 -6 -778 -100} .0620) .038)) .331 | .303 | .270 .341 | .307 | .272 
5 oF . 845 -108} .037 .004)| .3893 | .322 | .274 .404 | .323 | .274 
-632 8 -903 -088) — .019 —.on7 .458 | .392 | .202 .466 | .392 | .208 























Same as Table 1, but a and @ both to be estimated. Comparison of statistics of the three estimators 
for various positions of the dosages. Statistics of the maximum likelihood and minimum logit x? esti- 
mates based on total sampling population, those of minimum Pearson ? on stratified random sample of 
1000 at each dosage arrangement. Samples not yielding finite estimates by maximum likelihood omitted 
in calculating all statistics. 


the maximum likelihood estimate was positive for all dispositions of 
the dosages, it is in the present situation negative, and increases in 
absolute value as the difference of central dose from P=0.5 increases. 
For both minimum ,? estimates the bias is negative in the experiment 
with central P=0.6, but this negative bias decreases in absolute value 
as the central P of the experiment is removed further from P=0.5, 
and is positive at central P=0.8. At some point between P=0.7 and 
P=0.8, we may surmise that the minimum x? estimates of a are un- 
biased. Since the negative bias increases for the maximum likelihood 





6 See Appendix Note 3 for a discussion of samples insoluble by maximum likelihood. 
7 The bias of 0.002 shown for the minimum Pearson x? estimate is a sampling error. 
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estimate continuously with increase of a (change of central P of dosage 
arrangement), there is a zone of a for which the bias of the minimum 
x? estimates is less than that of the maximum likelihood estimate, a 
situation which does not obtain when a alone is to be estimated. At 
central P=0.7, the bias of all three estimates is negative and that of the 
x? estimates is smaller than that of the maximum likelihood estimates. 
As respects the variance and mean square errors, for all dosage ar- 
rangements of the experiments these are smallest for the minimum 
logit x? estimates, next larger for the minimum Pearson x? estimates, 
and largest for the maximum likelihood estimate, as was the case with 
only a to be estimated. 

As regards the estimate of 8, at disposition of dosages central P=0.5, 
the bias situation is different from what was found for estimate of a, 
either with 8 known or to be estimated simultaneously with 8. Here 
all three estimates are biased, the maximum likelihood showing the 
largest bias, the minimum Pearson x? the next smaller, and the mini- 
mum logit x? the smallest bias. It is to be recalled that in the case of 
the one parameter a to be estimated, the mean square errors of the 
minimum x? estimates were smaller than those of the maximum likeli- 
hood estimates in spite of a larger bias, because of the smaller variance. 
Here, although the relations of the biases are different with different 
disposition of dosages, the bias as well as the variance, in a zone of 
dosage arrangements, is less for the x? estimates than for the maximum 
likelihood estimate. As regards variance around the mean, and mean 
square error, the comparisons are uniformly favorable for the x? esti- 
mates at all dose arrangements. As before for a to be estimated with 8 
known or unknown, both the variance and mean square error of the 
estimate of 8 are smallest for the minimum logit x? estimate, next larger 
for the minimum Pearson ,? estimate, and largest for the maximum 
likelihood estimate. 

A reader of the manuscript of this article suggested that it is worth 
emphasizing that the biases of the estimates are not large. To this 
may be added the observation that the biases of the minimum logit 
x? estimate and that of the maximum likelihood estimate, such as they 
are, are practically equal (though of opposite sign) and that the mini- 
mum logit x? estimate nowhere achieves its smaller mean square error 
from a smaller bias alone. For centrally placed dosages, which is the 
arrangement to which a well-designed experiment approximates, for 
estimates of a the bias is zero. Therefore, although we are not dealing 
with a system of unbiased estimates, we can from a practical view- 

















MAXIMUM LIKELIHOOD AND MINIMUM x? ESTIMATES 137 


point fairly disregard the question of bias and consider the comparisons 
as essentially in terms of the variances. 


SUPPLEMENTARY EXPERIMENTS 


The experiments which have been described are the basic ones of the 
present investigation and involve calculations of the total sampling 
population. Supplementary to these I did some experiments, having in 
mind specific questions that arose in the course of a discussion of the 
results described in previous sections of this paper. These questions 
had to do mostly with whether the spread of the dosages and/or their 
number might reverse the comparison of results as between the maxi- 
mum likelihood and minimum x? estimates. Three experiments were 
performed, comparing maximum likelihood and minimum logit x* 
estimates, employing in each 100 stratified samples, equally spaced 














TABLE 3 
ESTIMATE OF a AND 8: DOSAGES SPREAD 
a b 
Statistic 
Max. lik. | Min. logit x*|) Max. lik. | Min. logit 

Mean .035 .034 .087 —.181 
Variance . 287 . 186 .352 .182 
Mean square error . 288 . 187 .360 215 




















Three equally spaced doses, with respective P =0.1, 0.5, 0.9. 10 at each dose. a =0, 8 =2.197, both 
to be estimated. Comparison of minimum logit x? and maximum likelihood estimates. Based on 100 
stratified random samples; 13 per cent of samples with infinite estimate by maximum likelihood omitted 
in calculating all statistics. 


doses, 10 at each dose :* (1) 3 doses, central dose corresponding to P=0.5, 
lowest dose corresponding toP =0.1, highest dose to P=0.9, both param- 
eters to be estimated, in which situation about 13 per cent of the 
samples have an infinite estimate by maximum likelihood; (2) 11 doses 
equally spaced, central dose corresponding to P=0.5, lowest dose cor- 
responding to P=0.01, highest dose to P=0.99, 8 known, a to be esti- 
mated; (3) 4-point parallel assay, lower dose corresponding to P=0.3, 
upper dose to P=0.7, 10 at each dose for both unknown and standard. 
The results, shown in Tables 3, 4, and 5, are entirely in keeping with 
those found in the main experiments. Judged on the basis of variance 





* See Appendix Note 2 on methods of sampling and calculation. 
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about the mean or mean square error the minimum logit x? estimate 
appears better than the maximum likelihood estimate. 

Finally there should be mentioned the remarkable experiments of 
Taylor [27], though they are not a part of the present investigation. 
He explored the extreme situation of many doses up to 100, with only 


TABLE 4 
ESTIMATE OF a: 11 DOSES 








Statistic Max. lik. Min. logit x* 





Mean .000 — .005 
Variance .103 .069 
Mean square error .103 .069 











Experiment with 11 doses equally spaced, 10 at each dose, 6 =0.91902 known, a =0 to be esti- 
mated. Lowest dose at P =0.01, highest dose at P =0.99. Comparison of minimum logit x? and maxi- 
mum likelihood estimates. Based on 100 stratified random samples. 


TABLE 5 
ESTIMATE OF LOG RELATIVE POTENCY 








Estimator | Mean | Variance | M.S.E. 





Max. lik. — .014 .197 | .197 


Min. logit x? — .015 .183 .183 





Four-point parallel assay, 10 at each of the four observations, lower dose P =0.3, upper dose 
P =0.7. Standard and “unknown” equal potency, relative potency p=1; M =log p =0. Based on 100 
stratified random samples. 
one at each dose, comparing the maximum likelihood estimate with the 
minimum Pearson x? estimate. His results are in agreement with 
those of the present study; that is, the mean square error of the maxi- 
mum likelihood estimate was found to be larger than that of the mini- 
mum x? estimate, the contrast being even more pronounced with one 
at each dose than in the present series which was standardized with 
10 animals at each dose. The mean square error of the maximum 
likelihood estimate, using 20 animals, was attained with the minimum 
x’ estimate using only 5 animals. 


THE ESTIMATES AND THE INFORMATION LIMIT OF THE VARIANCES 


It was observed in the early phases of the present investigation, 
when only some results for the maximum likelihood and minimum 
Pearson x? estimates were available, in the situation of 8 known, a 
to be estimated, that not only was the mean square error of the x’ 
estimate less than that of the maximum likelihood estimate, but that 
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the variance about the mean, as well as the mean square error, was 
less than 1/7, where J=E (0 In ¢/da)? is the “amount of information” 
of Fisher, @ being the probability of the total sampie set. Since the 
quantity I, related to the reciprocal of the variance, is sometimes spoken 
of as the total amount of available information, the minimum Pearson 
x? estimate seemed to be extracting more than the total amount of 
information present in the data! The quantity 1/J is the Cramér-Rao 
lower bound for the variance of a regular unbiased estimate, and since 
with P of central dose at 0.5, the minimum x? estimate is unbiased, the 
finding that the variance was smaller than 1/Z was in apparent con- 
tradiction with this fundamental law also. At first it was speculated 
that the explanation of the paradoxical findings lay in the fact that 
there is always at least a small fraction of the total population of 
samples for which there is no finite solution either by minimum ,? or 
maximum likelihood, and that the Cramér-Rao theorem applies only 
when all the samples have finite estimates. But a short time later, cal- 
culations were completed which provided the values of the minimum 
logit x? estimates, this estimator as defined giving a finite estimate for 
ali samples. The variance was found to be even further below the 
Cramér-Rao bound than with the minimum ,? estimate! 

These findings had the effect of upsetting an apple cart, for they 
seemed to cast doubt on the calculations which had been made and so 
on the entire investigation. The inquiry was in fact interrupted for a 
period of several months while a search was made for the error which 
I presumed had been committed. Although the difficulty was eventual- 
ly resolved, I am reporting the incident of the appearance of the dilem- 
ma, because it is instructive in clarifying some prevalent misunder- 
standings regarding the lower bound of the variance. Also it affords me 
the opportunity of expressing great gratitude to Professor J. Neyman, 
who first put his finger on the exact point which was the source of the 
difficulty. 

The complete formula for the lower bound of the mean square error 
of an estimate 6 of a parameter @ is given by the following inequality 
(assuming certain conditions of regularity). 


(+5) 
00 


E(6 —6)?= ; + b?, (15) 





where b= E(@—6) is the bias of the estimate 6 [22].° 





* This inequality is given incompletely in Cramér’s [10] text, which was my source for the formula 
at the time referred to, and this was in part a reason for the confusion. Professor Joseph Hodges first 
Pointed out the error to me and gave me the correct formula. 
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If b=0, then the mean square error becomes the variance, and it ap- 
pears that its lower bound is given by the reciprocal of J, as stated 
previously. The question then was how to account for the finding that 
with the dosages placed symmetrically about central dose correspond- 
ing to P=0.5, the x? estimates were unbiased, while their variances 
were less than 1/J. The answer was given to me by Professor Neyman. 
He pointed out that even if b=0, if 0b/da0, and has in fact a nega- 
tive value, then the right hand side of (15) can be less than 1/J. Such 
a situation can obtain if the estimate is unbiased for some but not all 
values of a. At the time that this suggestion was made, the values of 
b had been computed for a number of values of a,'° but they had been 
recorded in the calculation sheets and had not been subjected to any 
special scrutiny. The values of b were now recovered from the archives 
and the bias in relation to a was studied by graphic methods. The 


TABLE 6 
ESTIMATE OF a, 8 KNOWN: LOWER BOUND 








Maximum Min. Min. 


True P at dose i likelihood Pearson x? logit x? 


Low | Mid | High L.B. | M.S.E. || L.B. | M.S.E. |} L.B. | M.S.E. 











3 ‘ i .149 || .158 | .158 .137 | .139 -131 | .137 
.391 ‘ .778 || .154 || .164 | .164 142 | .144 -135 | .141 
5 ‘ -845 || .169 || .185 | .187 -156 | .162 151 | .159 
.632 ‘ .903 || .208 || .239 | .249 .194 | .211 -182 | .196 









































8 =0.84730 known, a =0 to be estimated, conditions as in Table 1. Valves of 1/I =1/ EnjPiQi, 
and comparison of mean square error with lower bound. 
bias function is shown in the figure. It is seen that for both the mini- 
mum Pearson x? estimate and the minimum logit x? estimate, the bias 
is either zero or negative, and its first derivative is indeed everywhere 
negative, while the bias for the maximum likelihood estimate is either 
zero or positive and its first derivative is positive. When the values of 
0b/da, estimated from the bias function graphically, and the computed 
value of b, were inserted in (15) for the correct calculation of the lower 
bounds, the mean-square errors of the x? estimates fell properly above 
their respective lower bounds. These results are presented numerically 
in Table 6. 

10 A change of disposition of the dosages with samples considered as from a given logistic function 
such as one defined by a =0, 68 =0.84730, is equivalent to samples drawn with the centrally disposed 


dosages unchanged but from functions with different values of a =logit P., where P, is the value of P 
corresponding to the central dose. 
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Several points are to be made in connection with the findings shown 
in this table. 1. The variance (Table 1) of the maximum likelihood esti- 
mate is larger and that of the x? estimates is smaller than 1/J. 2. The 
lower bounds, the computation of which depends on the previously 
calculated values of the bias b, but not on the calculated values of the 
variances, are in the same order as found for the variances and the 
mean square errors; that is, the bound for the minimum logit x? esti- 
mate is lowest, that for the minimum Pearson ,? estimate is next 
higher, and that of the maximum likelihood estimate is highest. This, 
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Relation of bias b, and db/da to a, shown for positive values of a. For negative 
values of a, the bias reverses sign, while the sign of db/da remains the same. 


it seems to me, can be considered as some confirmation of the basic 
validity of the findings respecting the order of the independently cal- 
culated variances and the mean square errors. 3. In the case of the 
maximum likelihood estimate, at dosage arrangements with central 
P=0.5 and P=0.6, the mean-square errors and lower bound values 
are the same, to the precision of 3 significant figures as calculated; 
at central P=0.7, the difference between them is small, and only at 
central P=0.8 is the difference appreciable. Beyond central P=0.8— 
at P=0.85 for instance—the statistics were studied and the difference 
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between the mean square error and the lower bound was found to be 
even greater than at central P=0.8. The failure of the maximum likeli- 
hood estimate to attain its lower bound for the mean square error is 
perhaps surprising from general considerations of the properties of the 
maximum likelihood estimate, but it is in keeping with the specific 
findings of the present investigation. It is known that the lower 
bound is attained by an estimate only if the estimate is perfectly 
correlated with the logarithmic derivative 0 In ¢/006. As is pointed out 
later (See Appendix note 4), the maximum likelihood estimator yields 
infinite estimates which are not in one-to-one relation with the logarith- 
mic derivative, and therefore it could be anticipated that the maximum 
likelihood estimate cannot everywhere attain the lower bound value 
for its mean square error. 

The logistic function with binomial variation of the dependent vari- 
ate has sufficient statistics, which are > np; and > n,z,p, for the esti- 
mate of a and £ respectively." Since sufficient statistics exist here, the 
possibility of improvement of the estimates by way of the Rao-Black- 
well [24] [6] theorem presents itself. According to this remarkable 
theorem, if there is a sufficient statistic u for a parameter 0, and an 
estimate ¢ which is not a one-to-one function of uw, the conditional ex- 
pectation of ¢ given u is another estimate which has the same expec- 
tation as ¢ and smaller variance than ¢. The maximum likelihood esti- 
mate, when it is finite, is necessarily a one-to-one function of the suffi- 
cient statistics >-nip;, >-n.xip;, as can be seen directly from the equa- 
tions of estimate, and the conditional expectation of the estimates 
given the sufficient statistic is therefore identical with the estimate 
itself; the maximum likelihood estimate is therefore not subject to 
improvement by means of this theorem. However, each of the minimum 
x’ estimates is subject to improvement. For the estimate of a, given 8 
as known, the Rao-Blackwellized x? estimates were computed directly, 
and their mean square errors were calculated. These are shown in 
Table 7. Here something extraordinary is evident—within the precision 
of the arithmetic calculations, the mean-square errors of the minimum 
logit x? estimates are equal to the previously and independently computed 
lower bounds, for all values of central P. Thus this estimate attains what 
Rao [23] calls the information limit of the variance. I believe that the 
Rao-Blackwellized minimum logit x? estimate is the first estimate 
achieved that has full efficiency in finite samples and has variance less 
than 1/J. It is, of course, also sufficient. 





lt See Appendix 4 on the sufficiency of the estimates. 
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APPENDIX NOTE 1. THE DEFINITION OF THE MINIMUM x? ESTIMATE 


Some writers [12] [14] [18] have compared the minimum x? estimate 
unfavorably with the maximum likelihood estimate, and their exam- 
ples include the estimate of the binomial parameter P. Yet, as de- 
veloped in the present paper, these two estimates are identical, so 
that one can hardly be inferior to the other. An examination of the 
works referred to discloses that there is ambiguity in respect to what 
is meant by the minimum ,? estimate. The contradiction brings to the 


TABLE 7 
ESTIMATE OF a, 8 KNOWN: RAO-BLACKWELLIZATION 








Minimum Pearson x? Minimum logit x? 





Lower Before After Lower Before After 
bound R-B R-B bound R-B R-B 





.137 .139 .137 131 .137 .131 
-142 .144 .142 135 .142 135 
. 156 . 162 . 158 151 .159 151 
.194 211 .200 . 182 .197 183 
225 .255 .239 . 206 .224 206 
.271 .305 . 282 .239 .261 . 239 





























8 =0.84730 known, a =0 to be estimated, conditions as in Table 1. Comparison for the x? esti- 
mates, of the lower bound values with the mean-square errors, before and after Rao-Blackwellization 
of the estimates. In this table the statistics for the minimum logit x* estimate do not exclude the two 
samples omitted in Table 1. 


fore a point which I [5] discussed several years ago in connection with 
the x? test. Very frequently, data as they present themselves permit 
different ways of properly calculating the x, and this presents difficul- 
ties for the interpretation of the results when the x? test is done with 
any particular arrangement. Now we encounter the difficulty in rela- 
tion to estimation. 

Consider the situation of a die with probability of “success” P, 
thrown a very large number of times N, or equivalently N identical 
dice each with characteristic P thrown once, or, more generally stili, 
that there have been N throws accomplished by throwing N/r sets 
with r in each set. Perhaps, as in Weldon’s [21] famous case, they have 
been thrown 12 at a time. If they are fair, the probability P of the ap- 
pearance of a 5 or 6 is 1/3, and if the “null hypothesis is true,” the 
expected numbers for 0, 1, 2, - - - , 12 successes in the throws are given 
by the 13 successive terms of N/12(2/3+1/3)”. To test this we can 
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set the corresponding observed frequencies against the expectations 
and perform the x? test as for 12 D.F. (We suppose N large enough so 
that we can use the asymptotic x? distribution.) Now someone might 
not like 12 and propose that the observations should be obtained by 
taking up the N/10 sets of 10 successive dice each, for a test with 10 
D.F., or generally, the N/r sets for a test with r degrees of freedom 
against the expectations given by 


N/r(2/3 + 1/3)’. 


Which of these should be used for the test?!* Or, to consider the esti- 
mation problem, suppose we do not wish to test whether P=1/3 but 
instead we do not know P and wish to estimate its value as p by mini- 
mizing x?; the x? of which arrangement should we minimize? 

We can note that if we take r=1, we have the four-fold table x’, 
which in fact is the one we have used as the minimum ,? estimate; 
this gives =S/N where S is the total number of successes, and this 
estimate is the same as the maximum likelihood estimate. Now it is 
known [9] that for a regular unbiased estimate of P, the sampling vari- 
ance of this estimate is the smallest possible, and that should be a 
sufficient reason for choosing this x? to minimize, as the definition of 
the minimum ,’ estimate, for it is self-defeating to stigmatize the mini- 
mum x? estimate as inferior, by reference to a x? estimate that is less 
than optimum. 

We can similarly analyze the minimization of x? for the estimate of 
the parameter \ of a Poisson distribution. If we have observed fre- 





quencies 2%, - ++, 2, * * * , 2, considered hypothetically as a sample of 
a Poisson distribution with parameter \ 
ey 
p(x;) = ’ (16) 
z;! 
the expected numbers for x having the values 0, 1,2, ---,7,+--,rare 
given by 2 
Fi = np(a). (17) 
We may evaluate a x? for a test with r degrees of freedom by enumer- 
ating the corresponding observed frequencies of fi, fe, --+,fi,° °°, Jr 


i|=r \ ais 2 
ur) = pe ee . (18) 





12 There are many more arrangements possible. For instance, we may add the x*’s of the N/10 sets 
of 10 for a test based on N degrees of freedom. 
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The x? of (18) may be modified, if we do not know d, to evaluate a x’ as 
for (r—1) degrees of freedom by substituting f’ for f, with f’ evaluated 
from (16) using the mean <= > 2/n instead of X. 


i=_r cn ree 
X2(r-1) = > set ‘ (18’) 


If, instead of considering the frequencies f of the frequencies z, we 
consider the x directly as observed frequencies, and since the expec- 
tation for each x is \, we can evaluate a x? as for n degrees of freedom. 


—»)? 
vw = : 


i=] 





(19) 


The x? of (19) may be similarly modified, if we do not know X, by sub- 
stituting # for \, and evaluating a x? as for (n—1) degrees of freedom. 


i=n (x ne =)? 


Xm = Dy 


inl z 





(19’)# 


Consider again the x;n)? of (19). If the observed frequencies 2; and 
a, are added together and set against the expectation 2A, this taken 
together with the remaining (n—2) values of x gives a x? as for (n—1) 
degrees of freedom. More generally, if r values of x are combined to 
constitute a group for which the expectation is rd, this taken together 
with the remaining (n—r) values of x for each of which the expectation 
is \, gives a x? for (n—r+1) degrees of freedom. If we let r=n, we have 
for the expected total nd, and for the observed }\"z=nZ and a x? as 
for 1 degree of freedom. 


(— >)? 





xa =n (20) 
If the x? of (20) is minimized for the estimate of \, we obtain 
X = &. (21) 


The estimate (21) is identical with the maximum likelihood estimate 
of the Poisson parameter, and is known to be “best.” 

We see then that in the cases considered there are many statistics of 
the observations distributed as x? (in “large samples”), which yield 
minimum  ? estimates, but only one which yields the “best” estimate. 





3 The x? of (19’) is sometimes referred to as the “index of dispersion” of the Poisson distribution. 
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I suggest that unless otherwise stated we should mean by the “mini- 
mum x? estimate” the “best” minimum x? estimate, that is, one ob- 
tained by minimizing that x? which yields the “best” estimate. By 
“best” estimate I should mean the one with smallest mean square 
error, but some other definition can of course be taken. 

In the case of estimating the binomial or Poisson parameter, we 
recognized which x? is “best,” because we knew independently which 
estimate is best. Can we state some rule by which the “best” x? would 
be recognized, without this knowledge? Perhaps, but as far as I know 
this question has not been investigated mathematically, and no rule 
can be presently given. It is interesting, however, to speculate! Sup- 
pose we consider the use of the x?’s which we can calculate from the 
observations, for tests of significance instead of for estimation—which 
x? should then be used? Here I think we must invoke the Neyman- 
Pearson theory of tests of significance and say, ““‘We should choose the 
x? which is the most powerful test for a specified alternative—and it 
would depend on what that alternative is.” 

If we recall again Weldon’s series of dice thrown in sets of 12, we 
may suppose a situation in which we were quite certain that the dice 
were all “fair,” that is, if they were thrown at random in a long series 
of throws, the limiting value of the relative frequency of 5 or 6 would 
be 1/3, but where we were not sure that they were in fact thrown 
fairly, that is, we thought they might have been manipulated to give 
favorable combinations. It seems then, on intuition, that it would be 
well to consider the x? calculated as for r=12, because since the dice 
were thrown in groups of 12, the manipulations would have been per- 
formed in this grouping, and the discrepancy from the expected x? 
would be likely to be apparent for this arrangement; if the r was taken 
as, say, 7=6, this discrepancy might be “averaged out.” 

But suppose that, on the contrary, we knew that each had the same 
probability P of showing 5 or 6, and that the throws were at “random,” 
but we were uncertain as to whether P=1/3. Then I think it can be 
shown that the most powerful x? test is with the use of the x? with r=1. 
Supposing that this is correct, then we have an identity between the 
x? to minimize for the estimate of P and the x? to use for a test in which 
the only admissible alternative is P+1/3. This is aesthetically satis- 
fying in that the x? which is best for testing for P is the same as that 
used for estimating P. It suggests itself that this may be the general 
rule, that is, that the “best” x? for estimate of a parameter @ of which 
the probability of the observations is a function, is the best x? for test- 
ing against an alternative to the tested hypothesis, different only in 
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respect of the value of the parameter 6. However, this is here quite 
speculative, and requires for proper elucidation competent mathemati- 
cal investigation. 


APPENDIX NOTE 2, METHODS OF SAMPLING AND CALCULATION 


For the main experiment, dealing with the logistic function a=0, 
8=0.84730, at various arrangements of equally spaced doses, statistics 
were calculated as for the total population of samples, except in the 
case of the minimum Pearsoi: x? estimate of the two parameters simul- 
taneously, for which a stratified random sample of 1,000 was used with 
each dosage arrangement. For each of the rest of the experiments, a 
stratified random sample of 100 was used. 

The method of constructing the stratified random samples can be 
described in terms of the experiment with 3 equally spaced doses cor- 
responding respectively to true P equal to 0.1, 0.5, 0.9. We wish to have 
100 samples, each representing an exposure of 10 animals at each of 
the 3 doses. The 100 observed values of p for the first dose represent 
samples from a binomial distribution with true P=0.1. With 10 animals 
exposed there are 11 possible values of p. The expected number of 
samples for each of these, from p=0 to p=1, for a total number of 
100 samples was calculated from the binomial distribution 100(0.9 
+0.1)!°, For each p a number of cards were counted out equal to the 
nearest integer with this expected number, and on each card was 
punched the value of p. For the fractional units of expected numbers, 
cards were assigned to p’s at random, to make up the total of 100. 
When this part of the procedure was finished, we had a sample of 100 
very closely representative of the binomial (0.1+0.9)'!*. These were 
now randomized by punching a random number of 3 or more digits 
on each card and sorting the 100 cards on this random number." In the 
same way a stratified random set of p’s was obtained for P=0.5, and 
for P=0.9. The first p of each of the 3 randomized sets constituted the 
first of the 100 samples to be used for the experiment, the second p 
of each set the second sample, and so forth. The resulting set of 100 
samples is designed to give a better estimate of the desired statistics 
than would be obtained from a random sample not stratified according 
to the expected p’s. 

Several comparisons were made of the results obtained with samples 
constructed in this way and the correct statistics as calculated from 
the total population, and it was found that a stratified random sample 





“ Ordering on a random number randomizes the original ordered series of cards. 
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as large as 100 gave very good estimates. As examples, there are shown 
in Tables 8 and 9 comparisons of statistics as obtained from a stratified 
random sample of 100 and as obtained from calculation for the entire 
sampling population where these were available, for estimation of a 
alone and for estimation of both a and 8, with central P=0.5 and with 
central P=0.7. It is seen that very satisfactory estimates are obtained 
with a stratified random sample of as many as 100. 


TABLE 8 


ESTIMATE OF a, 8 KNOWN: COMPARISON STRATIFIED 
SAMPLE AND TOTAL POPULATION 








Variance Ratio max. lik./E 





Estimator E Total 100 Total 100 
pop. samples pop. samples 





Central P =0.5 





Max. lik. - 158 156 
Min. x? -139 136 
Min. logit x? . 137 130 





Central P=0.7 





Max. lik. .186 .192 
Min. x? .161 .161 
Min. logit x* .156 .155 























Comparison of variances as for total population and as obtained from 100 stratified random 
samples. 8 =0.84730 known, a ==0 to be estimated; 3 equally spaced doses, 10 at each dose. 


For the calculation of the statistics of the total population, a mini- 
mum of 10 significant figures was used. With the estimates themselves 
written to 5 decimal places correct within +5 in the last figure, the 
final statistics were written to 3 decimal places and are believed to be 
generally correct to within +1 in the last figure. The estimates ob- 
tained with the stratified random samples appear to be generally reliable 
within +5 of the second significant figure. 

Punch card machines, including sorter, tabulator, summary punch 
and collator, but not automatic multiplying punch, were available 
and were used, as these were found convenient. For the calculation of 
the estimates, electric Monroe calculators were used. For obtaining the 
p corresponding to given logit, the exponential tables of reference [25] 
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were used, and for obtaining the logit corresponding to given p the 
tables of the natural logarithms of reference [26] were used. In the 
case of the maximum likelihood and minimum Pearson x? estimates, 
iterations were continued till the difference between two successive 
iterations was less than 0.00005, and the estimates, being written to 
5 decimal places, were therefore correct within +5 of the last figure. 
Each estimate after calculation was checked by insertion into the con- 
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TABLE 9 


ESTIMATE OF a AND 8: COMPARISON STRATIFIED SAMPLE 
AND TOTAL POPULATION 








Estimate of a 


Estimate of B 





Estimator E 


Variance 


Ratio 
max. lik./E 


Variance 


Ratio 
max. lik. /E 





Total 
pop. 








100 
sam- 
ples 





100 
sam- 
ples 


Total 
pop. 








Total 
pop. 


100 
sam- 
ples 








Central P=0.5 





Max. lik. 
Min. logit x? 


183 
-150 


1.00 
1.21 


1.00 
1.22 











Central P =0.7 





Max. lik. 
Min. logit x* 


.430 
.393 


-406 
375 


1.00 | 1. 


00 |} .393 
1.09 | 1.08 


.274 


.350 ° 1. 
257 i. 











Comparison of variances as for total population and as obtained from 100 stratified random 
samples. a =0, 8 =0.84730, both to be estimated; 3 equally spaced doses, 10 at each dose. 


ditional equations of estimation. For calculation of the statistics a 
minimum of 10 significant figures was retained. The estimates of the 
two main experiments (a to be estimated, and a together with 8 to be 
estimated) for each of the 1,331 possible samples and their squares 
were punched on cards, as well as the probability of the sample, so 
that sums and sums of squares were obtainable by machine tabulation. 
A certain amount of over-all checking was done by use of the automat- 
ic multiplying punch through contract with an International Business 
Machines Corporation service bureau. All work was checked, in some 
cases more than once, by direct and indirect checks. Still there is no 
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guarantee that the results are entirely free from arithmetic error, 
though I am fairly certain there is no serious mistake in the final results 
arising from this source. 

The estimates themselves were obtained as follows: For the minimum 
Pearson x? estimate of the two parameters simultaneously, a routine 
iterative procedure in terms of logits was employed, using weights and 
working values as given in reference [2], except that here, as for all 
other estimates requiring iterative procedures, what was solved for in 
each iteration was da and 0b, the corrections to provisional values, 
rather than a and b. For the estimate of the one parameter a with 8 
known, the same method was used, with of course only the one equa- 
tion instead of two to be solved. When the main work was completed 
I learned from Dr. William Taylor that he had developed an explicit 
solution for this case! This is ve by 


in ee 
"Date 


Had I known of Taylor’s solution at the start of the experiment, the 
statistics of the minimum ,? estimate of a with 8 known could have 
been obtained with less than half the time and effort required by the 
iterative procedure. Taylor’s solution was utilized for a certain amount 
of checking. 

For the maximum likelihood estimates the following general scheme 
was used, which is appreciably easier than that usually advanced. The 
equations of estimate are, for the present situation with equal n at all 
doses, 


a= 


> (p — 6) = 9, (22) 
> «(p — 6) = 0. (23) 


If we have fo, a provisional value of # corresponding to provisional 
values a» and bo, and if J, is the corresponding provisional logit, that 
is, if 

lo = ao + bor, 
we can write for /— fo, approximately 
b — po ( — bi) pogo, (24) 
and from (22) (23) the approximate equations of estimate are 
Llp — po) — (f= be) pogo] = 0, (25) 
} a{(p — po) — ( - 1s) Pogo] = 0. (26) 
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Since (i—1,)=da+dbz; where da and db are the corrections to the 
provisional estimates a) and bo, the corrections are given by: 


LX pogox( Dd) p — DL fo) 
DX Pogo 


~  g  (QE bogor)? 
2s beter DX Pogo 


DL pe — DY por — 





db = 





_ LP - Li bo- bd hoger 
Li pogo 


Again it was not till most of the relevant work was done, that the 
explicit approximate solution of the maximum likelihood estimate for 
3 equally spaced doses, equal n at each dose, advanced by Wilson and 
Worcester [30], was examined. Although these authors put down their 
published estimates to only 3 significant figures, it was found by trial 
that their solution usually provides estimates correct to 4 or 5 signifi- 
cant figures. Had this been realized early, a great deal of labor could 
have been saved in the calculation of the maximum likelihood esti- 
mate for the 3 dose experiments. 

What has been described in previous paragraphs is the procedure 
used for obtaining the maximum likelihood estimates in all cases ex- 
cept the experiment with 11 doses, 8 known, a to be estimated. For 
large number of doses, equal n at each dose, Taylor has developed a 
method of obtaining a very good approximation of the maximum 
likelihood estimate explicitly without iterative procedures. For esti- 
mation of a with 8 known and with equal numbers at all doses, the 
maximum likelihood estimate is obtained by equating >-p to > p. 
For a large number S of equally spaced doses x, coded successively 
0, 1, 2,---, (S—1), )A is given with close approximation by the 
definite integral 





da (28) 


S—1/2 1 
Lb ~~ dr, (29) 


1/3 1 +e —(a+bz) 


and setting this equal to > -p gives 


e-PS/2 — 8S/2+8(Zp—S) 


a = In . (30) 


ef(zP-S) — ] 





For the 11 dose experiment the method of Taylor gave the maximum 
likelihood estimate of a correct to 4 significant figures in almost half the 
samples and correct to at least 3 significant figures in practically all 
samples. 
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It is pertinent to note that the methods outlined in previous para- 
graphs for calculating the minimum x? and maximum likelihood esti- 
mates of the logistic function are considerably easier than are avail- 
able for the same estimates of the probit equation. There is no explicit 
solution of the minimum Pearson x? estimate of the mean of the probit 
equation, with the standard deviation known, corresponding to Tay- 
lor’s solution for the logistic function, when a is to be estimated with 8 
known nor can Taylor’s explicit approximate solution of the maximum 
likelihood estimate of the logistic function be applied to the probit 
equation, nor can Wilson and Worcester’s. Even the iterative procedure 
used for the maximum likelihood estimate of the logistic function in 
the general case as outlined in previous paragraphs is appreciably 
easier than any presently available for the probit equation. 

The advantage for calculation with the logistic function can be 
traced to the fact that with the iterative procedure for obtaining the 
maximum likelihood estimate, there is required implicitly the calcula- 
tion of the quantity >> (Zo/fogo)(p— po), where fo, Go, 20 represent provi- 
sional values which are different from iteration to iteration. With the 
probit equation this involves for each iteration and for each observa- 
tion, computation of the quantity (20/fogo)p, but since with the logit 
equation 2o= pogo, the coefficient 2o/fogo is unity and one needs to 
calculate only >°p, and this only once. The same point applies to the 
quantity (Zo/fogo)xp which is necessary in the estimation of 8, re- 
placed in the case of the logistic function by the single computation 
of >ozp. A secondary advantage arises from the fact that, since the 
Taylor’s series approximation of (p— f) in the terms of the linear trans- 
form is better in terms of the logit transform than in terms of the probit 
transform [3], the iterations with the logistic function converge some- 
what more rapidly. 


APPENDIX. NOTE 3. THE CASE OF ZERO SURVIVORS 
The equations which give the minimum logit x? estimate are 
DL npagils = Lo npagil:, (31) 
DL napaqdi = DL napa, (32) 
where n;, p;=1—q;, and 1; are the number exposed, the observed rela- 
tive frequency of response, and its logit In p;/g; respectively, at dose 
z;, and 1; is the estimated logit given by 1; =a+bz;,. 


For an observed p; which is either zero or 100 per cent, the value of 
l; is infinite, the value of p,q; is zero, and the value of p,q;l; approaches 
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zero as p; approaches either extreme value. If with p observed as zero 
or 100 per cent, we give p.qil; its limiting value of zero, then in the solu- 
tion of (31) (32) these observations are effectively eliminated; whether 
an observation had been made and it turned out to be zero or whether 
it turned out to be 100 per cent, or whether no observation had been 
made at all, the estimate derived is the same. On the basis of “common 
sense,” as well as on the basis of mathematical statistical theory, an 
observation of zero or 100 per cent made at some particular 2; fur- 
nishes “information” as respects the estimate, and omitting it wastes 
such information, if indeed it does not warp the estimate. It may be 
argued that it is not the experimenter who omits the observation, but 
the mathematics that does it, and that this endows the omission with 
statistical validity. Such a view perverts the purpose of statistical 
estimation, which is not to fulfill a preconceived general principle, but 
to produce an estimate. If there are specific situations in which an esti- 
mator behaves badly, it is not only permissible, it seems to me that it is 
required by sound statistical theory that the estimator be modified or 
abandoned in that situation, and a supplementary rule of estimation be 
applied which circumvents the difficulty and yields an estimate which 
in actual application will be acceptable. One such rule for the situation 
here considered is to use for an observation of zero a replacement 
working value of 1/2n, and for an observation of unity a working value 
(1—1/2n). The reasonableness of such a rule may be set forth as follows: 

Consider an experiment in which the samples are drawn at 3 equally 
spaced doses 21, 22, 2X3, corresponding to which the observations are 
pi=0.3, po=0.5, ps=0.7; since the logits of these fall on a straight line, 
we expect that the estimates of the parameters will correspond to this 
line, and this should be the case with any consistent estimator. Con- 
sider now another sample with p.=0.5 as for the first sample but with 
p: and p; decreased and increased respectively by an amount Ap—for 
instance with Ap=1/10, pi=0.2, p3=0.8; from this sample we shall 
obtain the same estimate of a as with the first sample, but 5 will be 
larger. If we change the values of p; and p; again in the same direction 
as before, b will increase again, and in general b will increase monotoni- 
cally with Ap. The value of p cannot change continuously, but for 
any specified n only in steps of 1/n; the smallest possible value for p 
greater than zero is 1/n and the largest possible value less than unity 
is (l1—1/n). With these values, the estimate of 6 will be b= —logit 
1/n=logit (1—1/n). At the next possible values of p; and p; with addi- 
tion of Ap, the estimate of 6 is b= ©. This value of 8 is meaningless 
practically and in fact is in contradiction of the fundamental assump- 
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tion that the true P’s follow a logistic function, for such a value of 8 
implies a true P of zero and unity at finite doses, whereas the logistic 
function is asymptotic to these values of P. 

The monotonic increase of b with Ap, taken together with the exclu- 
sion of an estimate of 8 corresponding to p,=0, p3=1.0, implies an 
estimate of 8 that corresponds to an observation p,, 0<pi<1/n, and 
of p3, 1>ps>(1—1/n). It seems reasonable to set p; halfway between 
0 and 1/n, that is, at 1/2n, and p; halfway between (1—1/n) and 1, that 
is at (1—1/2n). Any other working values within these limits would 
meet the requirements of the monotonicity of increase of the estimate 
of 8 with Ap and the exclusion of an infinite estimate for 8, that is we 
could take as the working value to replace an observed p of zero, a 
value p’=1/kn where k>1 is some other value than 2, but k=2 
recommends itself on the basis of simplicity if nothing else. 

Having adopted the rule of 1/2n provisionally, I performed a con- 
siderable number of experiments simulating situations of practical 
bio-assay, to compare the behavior of the estimates obtained in this 
way with those gotten when other procedures were used. Among other 
procedures tried was one analogous with that used in “probit analysis,” 
where a transform line is fitted to the points, excluding the observa- 
tions of zero and 100 per cent and using as substitute observations, 
ones predicted by the fitted line. For the minimum logit x? estimate, 
on the basis of what evidence I have been able to accumulate, I am 
quite satisfied with the simple 2n rule. Using it, all samples yield 
finite estimates, and judged on the basis of the mean square errors, 
these estimates were better than those obtained using any other rule 
which was tried. 

The arithmetic difficulty of what to do with observations of zero 
or 100 per cent, when one is using the linear transform of a function 
and the value of the transform corresponding to these observations is 
infinite, arose early in the use of the normal deviate or probit for fitting 
the integrated normal curve. The problem was brought to R. A. 
Fisher, who gave the answer that is incorporated in the now famous 
Appendix to a paper by C. Bliss [8], the title of which I have borrowed 
for the heading of this section, because it is essentially the same prob- 
lem as the one with which we are dealing here. The answer of Fisher 
was to use for the transform corresponding to an observed p of 100 
per cent the working value /+4/z and for an observed p of zero the 
working value /—(#/z), where /=1-—@ is the estimate of the true re- 
sponse at the dose, / is the linear transform value corresponding to , 
and z can be defined as 0f/d/. For the integrated normal curve, z is the 
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ordinate of the normal distribution at x; for the logistic function, it is 
equal to pg. This treatment of zero and 100 per cent response, first given 
by Fisher for the integrated normal curve, was applied also to the logistic 
function [17] and judging from a recent declaration [11], is considered 
by him to be mandatory for these and all similar situations. 

Not long after the publication of Fisher’s [8] note, it was made clear 
[7] that the substitution of the working value defined for the observa- 
tions of zero and 100 per cent was the application to these observations 
of a necessary modification of any observation p, in maximum likeli- 
hood estimation using the linear transform, the general formula for the 
working value being (p— p)/z+/. This device certainly meets the diffi- 
culty in many cases, and a finite estimate is obtained, in spite of the 
fact that the value of the transform for the observation is infinite. It is 
apparently widely believed that it meets the difficulty in all cases, but 
I found early in this investigation that it does not. 

There are samples which do not yield a finite estimate by maximum 
likelihood for one or both of the parameters to be estimated even using 
the working value of Fisher. In general, these are samples which, in 
an experiment with N doses, have N or (N—1) observations of zero 
or 100 per cent.* Thompson [29] called attention to the fact that for 
some such samples the maximum likelihood method does not converge 
to a finite estimate, and emphasized the necessity to make ad hoc pro- 
vision for such cases. For instance in the estimate of a given 8 as known, 
for the 3 dose experiment, samples with all three observations zero per 
cent or all three 100 per cent response, yield no finite estimate by 
maximum likelihood. In the estimate of both parameters simultane- 
ously, with the first observation zero per cent, the last observation 100 
per cent, and the observation at the second dose 50 per cent, maximum 
likelihood yields no finite estimate for 8, and with any other observation 
at the central dose, it yields no finite estimate for either parameter. 
In some situations these insoluble samples are extremely rare, in others 
they are not rare, while in some situations they are extremely frequent. 
For instance, in an experiment with 3 equally spaced doses, the cor- 
responding true P’s of which are 0.1, 0.5, 0.9, the insoluble samples 
constitute more than 12 per cent of the total sampling population; in 
an experiment with 4 equally spaced doses the lowest of which cor- 
responds to true P=0.01 and the highest of which corresponds to P 
=0.99, more than 20 per cent of the samples are insoluble by maximum 
likelihood. If in the 3 dose experiment the P of the lowest dose is P 





% But not necessarily any such sample. The same is true of the minimum Pearson x? estimate. 
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=(0.01 and for the highest dose it is P=.99, more than 80 per cent of 
the samples are insoluble by maximum likelihood.® 

I do not see how these estimates can be considered acceptable. The 
situations referred to are not “pathological.” On the contrary, they are 
quite reasonable; some of them correspond to very good experimental 
designs and yield excellent estimates by other available methods of 
estimation. An estimator that yields an infinite estimate in from 10 to 
80 per cent of samples for some well-designed experiments can hardly 
be considered “proper” and it seems to me that one is obliged either to 
modify or to abandon it. 

I have not had the temerity to modify the maximum likelihood or 
minimum Pearson x? estimate as I have done with the minimum logit 
x? estimate. In the text, the comparisons of the mean square error of 
the minimum logit x? estimates with the other estimates are made, 
omitting, in the calculation of the statistics of all estimates, samples 
which had an infinite estimate by maximum likelihood. For the main 
experiment with dosages centrally placed, these samples are not very 
frequent, but as the disposition of the dosages is moved to positions 
asymmetrical with respect to central P=0.5, insoluble samples be- 
come more frequent and the comparisons referred to in the text do not 
include situations in which samples insoluble by maximum likelihood 
constitute as much as five per cent of the total. 

The question has been raised as to whether the comparisons showing 
results advantageous to the minimum logit x? estimate are not the 
consequence solely of the application of the 2n rule to the minimum 
logit x? estimates, and not to the maximum likelihood estimates. As 
respects this question, it should be noted first that if the 2n rule did 
in fact account for the better results where these were obtained, 
and this was not peculiar to some special situation, it would be no ad- 
verse reflection on the results, but rather a support of the soundness of 
the 2n rule. However, it is not a fact that the application of the 2n rule 
to the minimum logit x? estimate and not the maximum likelihood esti- 
mate accounts for all the differences found between the estimates. 





16 The behavior of the maximum likelihood estimator in the present situation strikes one as logically 
obstreperous. If we think of the principle of maximum likelihood in its general intent, it should estimate 
for a parameter 0, a value which, if it were the true value, would yield the sample in hand most fre- 
quently. One would suppose, then, that if a sample S is very frequent with some particular 0, such a 
sample would characteristically yield @ by maximum likelihood, or estimates near 0. Now, with a 3 dose 
symmetrical experiment, 10 at each dose, with a =0, 8 =4.595, samples in the class 0, p, 1.0, where p 
is any value 0 <p<1.0, occur with probability about 80 per cent; yet no such sample yields by maxi- 
mum likelihood any value near 8 =4.595, for all yield 8 = ©, and only one samplein this class yields a 
finite estimate for a. On the other hand, samples in the class 0, 1.0, p occur only with probability of 0.09 
per cent, yet with all these samples except 0, 1.0, 1.0, maximum likelihood indicates finite estimates 
for a and @, not excessively far from the true values. 
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Even if the 2n rule is applied to the observations of zero or 100 per 
cent for the maximum likelihood estimate as well as for the minimum 
logit x? estimate, the comparisons favorable to the minimum logit x? 
estimates remain. In Table 10 is shown a comparison of statistics of 
the maximum likelihood and minimum logit x? estimates, with the 2n 
rule used in the case of observations of zero or 100 per cent response for 


TABLE 10 


APPLICATION OF 2n RULE TO MAXIMUM 
LIKELIHOOD ESTIMATE 








Mean-square-error 





Central P=.5 Central P=.7 





Mi Max. lik.}| Max. lik. Mi Max. lik.| Max. lik. 
, "an unmod- 2n , in. | unmod- 2n 
ogit x? ified rule ogit x? ified rule 





Estimate of a, B 





known 
a .130 .156 .151 .158 .193 171 
Estimate of a and B 
a .150 .183 .171 .375 .406 .394 
b . 256 .299 . 287 .257 .361 .275 





























Mean square errors with application of the 2n rule for maximum likelihood estimates. Based on 100 
stratified random samples. The maximum likelihood estimates are improved, but the relative position 
of the estimates is unchanged. 


the maximum likelihood estimate as well as for the minimum logit x’. 
It is seen that while the results are better for the maximum likelihood 
estimate than when Fisher’s rule is applied (Table 2), they still are 
inferior to those obtained with the minimum logit x? estimate. 


APPENDIX NOTE 4. SUFFICIENCY OF THE ESTIMATES 


A function of the observations T is a sufficient statistic for a param- 
eter 0 if, with respect to any other statistic T’, the conditional prob- 
ability of 7’ given T is independent of 6. If ¢ is the probability of a sam- 
ple set of observations, and T' is a sufficient statistic, then for all sam- 
ples with the same value of 7’, the value of 0 In ¢/00 is the same, and this 
being true, if it is also true that for all samples having the same value 
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of 0 In ¢/06, the value of T is the same, 7’ is a minimal sufficient statistic. 
In general, a sufficient statistic is minimal sufficient, if it is a function 
of any other sufficient statistic [19a]. 

For the logistic function 


dln @ 


= >) ni(pi — Pi), (33) 
0a 


where n; is the number “exposed” at 2;, p; is the observed relative 
frequency at z;, and P; is the true value of the probability at x; given 
by the logistic function. Since n; and P; are constants from sample to 
sample, the value of (33) will be the same in samples, if the value of 
>“nip; is the same in the samples. It follows that > n,p; is a sufficient 
statistic for a in the present situation, and it is a minimal sufficient 
statistic. 

Similarly, 

dln @ 
0B 

and > n,2;p; is a minimal sufficient statistic for p. 

A more direct demonstration of the sufficiency of the statistics 
dinip:, >-nirvp; can be given as follows: 

If P:=1-Qi, ni, ri=pni, si=ni—ri=Qmi, are respectively the 
probability of response, the number exposed, the number responding, 
and the number not responding, at z;, the probability of a sample is 


@= I] C%,,PrQe; 


[e-(atbzs) }* 
= C™,. 
II {1 +. em (atbai) |ni 
Il cx, [] {1 + e~(atbxi) |—nig—azeig—Bzeizt 


= II cx, [] [1 + e—(atbzi) ]-rigmazni—Banizigazrighzrizé , 


= Di nzi(p: — Pi) (34) 





It can be seen directly from (35) that if a is changed by an amount 
Aa, all samples having the same value for the statistic > r; will be 
changed in probability by the same factor and therefore that the fre- 
quency distribution of these samples will remain unchanged. Hence 
>-ri= >-n.p; is a sufficient statistic for a. Similarly, if 8 is changed 
by an amount A§, all samples having the same value for the statistic 
>-ria; will be changed in probability by the same factor, and the fre- 
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quency distribution of the samples will be unchanged. Hence ) ori; 
= ) n.p.2;is a sufficient statistic for 6. 

It is evident that any one-to-one function of a sufficient statistic is 
sufficient, and also that if an estimator yields a different value of the 
estimate for each possible sample, it is sufficient.!” 

Considering joint estimation of a and 8, so far as disclosed by this 
investigation, the minimum logit x? estimate is sufficient, for each pos- 
sible sample yields a different pair of estimates. For the estimate of a 
given 8, at least for all experiments with equally spaced dosages z, 10 
exposed at each dose, the minimum logit x? estimate is sufficient, for 
whenever two or more samples yield the same value for the estimate of 
a, these samples have the same value for the corresponding minimal 
sufficient statistic > np. 

The maximum likelihood estimate is infinite for some samples (See 
Appendix note 3), and for these the values of the corresponding mini- 
mal sufficient statistic }-nip; or >onixip; are not always the same;'* 
therefore the maximum likelihood estimate is not fully sufficient. 
However, excluding samples with infinite estimate, there is a one-to-one 
relation between the maximum likelihood estimate and the correspond- 
ing minimal sufficient statistic, and therefore so far as samples with 
finite estimates are concerned, the maximum likelihood estimate is 
minimal sufficient. 

The minimum Pearson x? estimator yields infinite estimates for the 
same samples for which the maximum likelihood estimate is infinite 
and therefore it also fails of full sufficiency. For samples yielding finite 
estimates, the minimum Pearson ,? is here sufficient, since all samples 
yielding the same value of the estimate have the same value of the cor- 
responding minimal sufficient statistic. 

Some of the findings of this investigation appear to conflict with 
generalities which are widely accepted as having been established 
mathematically, for example: 

1. A sufficient estimator when it exists is unique, that is, it is either 
the maximum likelihood estimate or a function of the maximum likeli- 
hood estimate [19]. 





17 As Fisher [15] puts it, “... the intrinsic accuracy of 7 can never be greater than when every 
possible sample yields a different value of 7 ... in such a case, the actual sample can be reconstructed 
without ambiguity from the value of 7, and so the value of 7... must contain the whole of the in- 
formation supplied by the sample.” 

18 For instance, all samples 0, 0, p with 0 Sp $1 yield an infinite maximum likelihood estimate for 
a, and all such samples except the sample with p =0 yield an infinite estimate for 8, but the value cf 
Zngp and of Xnj24p; is different for each such sample. 
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2. Among asymptotically efficient estimators, the maximum likeli- 
hood estimator extracts the most information from the data, and where 
there is a sufficient statistic, it extracts 100 per cent [14]. 

The first of these generalities is refuted by the finding that the mini- 
mum logit x? estimate is sufficient, while it is not a function of the 
maximum likelihood estimate. Assemble all 1,331 possible samples in 
groups, such that the samples in each group have the same value for the 
minimal sufficient statistic of «, > >nipi; we will call such a group of 
samples a “sufficiency group.” The maximum likelihood estimate of « 
with 8 known is the same for each sample in a sufficiency group, but 
the minimum logit x? estimate is not, and in fact is usually, but not 
always, different for each sample. Fixing the maximum likelihood esti- 
mate does not determine the minimum logit x? estimate, and the mini- 
mum logit x? estimate is therefore not a function of the maximum 
likelihood estimate. However, no two samples in different sufficiency 
groups have the same minimum logit x? estimate, which, since >> n,p; 
is minimal sufficient, is the necessary and sufficient condition for the 
sufficiency of the minimum logit x? estimate. 

As regards the second generality, the loss of extractable information 
associated with an estimator can be calculated as > 40°, where o? is the 
variance of d In ¢/06 for a set of samples yielding the same value of the 
estimate, ¢ being the probability of a sample, @ the parameter, esti- 
mated, and ® the probability of the set of samples. For a three-dose 
experiment with true P’s respectively at 0.01, 0.5, 0.99, both parame- 
ters to be estimated, 62 per cent of the population of samples haveva 
maximum likelihood estimate for a which is infinite. Moreover these 
samples do not all have the same value for @ In ¢/da= >.n.(p;—P,), 
and the calculated loss of information by the maximum likelihood 
estimator is 76.0 per cent. For 8, 82 per cent of the population of sam- 
ples yield an estimate of infinity, but most of the samples have the same 
value for 3 In ¢/d8= >-nja;(p:—P,), and the calculated loss of informa- 
tion by the maximum likelihood estimator for 6 is only 0.097 per cent. 
For the minimum logit x? estimate, there are no samples with estimate 
of infinity, and for all samples yielding the same estimate, the value of 
0 In ¢/06 is the same, so there is no loss of information with the mini- 
mum logit x? estimate in this experiment. 

These calculations and the statements based upon them are made 
from my best understanding of how to compute the loss of information, 
but I do not feel entirely secure in the correctness of my procedures. 
Perhaps Sir Ronald Fisher would object to grouping all samples having 
infinity as estimate, as having the same value for the estimate; perhaps 
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different ©’s must be considered as different values. But this poses 
starkly the conundrum of how these estimates are to be treated, not 
only as regards practical situations, but also as a mathematical ques- 
tion. 
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SOME STATISTICAL PROBLEMS IN RELATING 
EXPERIMENTAL DATA TO PREDICTING 
PERFORMANCE OF A PRODUCTION 
PROCESS 


T. W. ANDERSON 
Columbia University 


The performance of a production process is characterized 
by the quantity and quality of the output; it is affected by 
factors of production, such as conditions of the process and 
quantities of inputs. In experimentation where the factors are 
controlled, we assume a bivariate linear regression model with 
quantity and quality of output as dependent variates. In op- 
eration where quality of output is controlled by adjusting one 
of the factors, another regression model is used in which quan- 
tity of output and the one factor are dependent variates. The 
second model is derived from the first. It is shown how to use 
experimental data to estimate the coefficients of the regression 
of quantity of output in the second model; this regression func- 
tion is desired for predicting performance in operation. Con- 
fidence regions and tests of hypotheses are treated. The exposi- 
tion is in the form of an analysis of a particular problem met 
by a chemical engineering firm. 


1, INTRODUCTION 


N THIS paper we analyze a statistical problem met by a chemical 
I engineering firm (The M. W. Kellogg Company, Jersey City, N. J.) 
in applying results from experimentation to predicting performance in 
production. The feature of this problem that is of special interest is 
that in experimentation the two most relevant characteristics of the 
product, measures of quantity and quality, are permitted to vary but 
various factors are controlled, while in production one of the product 
characteristics, quality, is specified and one of the factors is varied to 
obtain the specified quality. The data from the experiments are used 
to estimate the coefficients of the regression model with two dependent 
variates which is the model appropriate to experimentation. From this 
information we derive estimates of the linear function which is ap- 
propriate to predicting the quantity produced in commercial operation 
when quality is controlled by varying a given factor of production. 
Confidence regions are given, and tests of hypotheses are considered. 

The particular problem analyzed here arose in work done by the 
Petroleum and Chemical Research Laboratory of The M. W. Kellogg 
Company. It occurred in a study of the effect of certain operating 
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variables on yields and qualities from a refining process involving the 
solvent extractions of a petroleum fraction for the production of a raw 
lubricating oil base stock. The paper is written in terms of this specific 
case, but the procedures used are described in sufficient generality to 
be applied to other problems which involve the same feature of inter- 
changing a dependent and an independent variable. It is expected 
that this feature will arise in many other situations; as a matter of 
fact, in The M. W. Kellogg Company this problem has arisen a number 
of times. The author is indebted to The M. W. Kellogg Company for 
calling this problem to his attention, for supporting a part of his work, 
and for furnishing the data analyzed in this paper. 

It turns out that the models and statistical methodology used in this 
problem are similar to those used in some econometric problems in- 
volving “structural equations,” although the reason that the modeis 
are appropriate here are different from the reasons in econometrics. 


2, THE ENGINEERING PROBLEM 


In the petroleum industry there are various processes in which oil 
of some kind is fed into a complicated and expensive apparatus; the 
main product of some processes is oil of a higher quality. In the process 
with which we are concerned two other liquids besides oil are fed into an 
extraction apparatus (called an “extractor”) ; we shall call these “Liquid 
A” and “Liquid B.” Besides the output of desired high quality oil, 
called “raffinate,” there is also drawn from the apparatus an undesired 
product, called “extract.” The flows of liquids A and B relative to the 
flow of feed oil, of course, affect the quantity and quality of the de- 
sired raffinate; the other important characteristic of the process is the 
temperature. Increasing temperature or flow of liquid A increases the 
purity of the product, but decreases the yield; increasing the flow of 
liquid B decreases the purity, but increases the yield. The feed oil, 
and liquids A and B flow into the extractor at constant rates while the 
process goes on; the flow of liquid A is measured in terms of the volume 
relative to the volume of oil; the flow of liquid B is measured as the per 
cent by weight of flow of liquid A. The temperature, measured in 
degrees Fahrenheit, is actually an average of temperatures in various 
parts of the extractor. There are, of course, a number of other charac- 
teristics of the process, but these are relatively unimportant compared 
to the ones mentioned above, and in the experiments these other char- 
acteristics were held as constant as possible. The flow of the desired 
product, namely the raffinate, is measured relative to the flow of the 
feed. The purity or quality of the raffinate is measured in a way too 
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complicated to describe here; it is an index with high values desired. 

In a given experiment the flows of liquids A and B and the tempera- 
ture of the process are set, and the flow and quality of raffinate are 
measured. The extractor is run for several hours; the data used are 
averages over the period of time for which the process is “stationary” 
or “in control.” For given flows of liquids A and B and for given 
temperature, the yield and quality vary from experiment to experi- 
ment. A part of this variation is due to variation in the characteristics 
of the feed, though there are other causes of the variation. 

When the process is used commercially, it is essential to control the 
quality of the raffinate. This is done by adjusting the temperature of 
the process so that the raffinate flowing out has a specified purity. The 
flows of liquids A and B are not varied, but are predetermined. Thus 
in the experiments flows of A and B and temperature are controlled 
(or “fixed”) variables, and yield and quality are uncontrolled, while in 
production flows of A and B and quality of output are controlled, while 
yield and temperature are not. 

From the experimental data on a given extractor, the engineer 
wishes to describe the performance of the extractor in production. He 
wants to be able to predict the yield of raffinate of specified purity that 
will result from given flows of A and B. 


3. THE STATISTICAL MODELS 


In experimentation the temperature, say x1, the flow of A, say 22, 
and the flow of B, say 23, are independent variates, and the yield, say 
yi, and the purity index, say y2, are dependent variables. We write 


(1) ys = ayo + an(a1 — F1) + or2(%2 — Ze) + orr3(%3 — Fs) + tH, 


(2) Y2 = a29 + Qn (%1 — 1) + ar2(22 — Eo) + a3(23 — Zs) + Ur, 


where #1, #2, and #; are some convenient values of 21, 22, and 2s, 
respectively; wi and wz are discrepancies and are considered as random 
or statistical variables. The engineers and statisticians working on these 
problems consider that the linear equations (1) and (2) are adequate; 
in particular, that wu; and uw, can be taken to have a bivariate distribu- 
tion, approximately normal with zero means and constant variances 
say o;? and o,? and covariance oi02p (or correlation p). Hence, (1) and 
(2) are treated as customary regression equations with three inde- 
pendent variates and two dependent variates. 
In production 2; is adjusted to give the desired y2; that is 
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20 Q22 
(a. — 41) = —ye — — — — (% — &%) 
O21 Q21 Q21 


(3) 
3 1 
— — (x — %) — — %&. 
21 21 
We consider uz as a discrepancy that holds constant for the entire run. 
The temperature is varied at the beginning of the run so that the purity 
is at the desired level, and then the temperature is held constant for 
the rest of the run. Thus in (3), y2, 22, and 23 are controlled or fixed 
variables and 2; is a statistical or dependent variable. When we put 
21— #1 as given in (3) into (1) we get 


(4) Yi = Bo + By2 + Bo(x2 — F2) + B3(zs — Zs) + 2, 


where 

(5) Bo = aio — Baro, 
(6) 

(7) =~ os = Bae, 


(8) = a3 — Bars, 


an 
(9) UW — —— Us = Uy — Bur. 
O21 


In (4) yo, 22, and 23 are fixed, and y; is allowed to vary. The discrepancy 
v has mean and variance 


(10) Ev = E(u — Bur) = 0, 
(11) Ev? = E(u, — Bus)? = oi? — 2Bororp + B?o2?. 


To predict the performance of the extraction process in production one 
needs to know the coefficients, Bo, 8, 62, and 63 and the variance. In 
this model, (4) and the variance of v give a complete description of the 
process. 

The statistical procedure we shall use involves estimating the co- 
efficients of (1) and (2) by least squares and estimating o1”, 2? and p 
by the residuals. The estimates of these parameters are substituted 
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into (5), (6), (7), (8), and (11) to obtain estimates of Bo, 8. B2, Bs and 
the variance of v. We shall also find a confidence interval for B and 
confidence regions for pairs of coefficients such as 6 and #3. 

The statistical methods applied here have been developed in con- 
siderable generality in earlier publications ({1], [2]). However, the 
methods needed in the present problem will be developed in this paper. 
Since the methods were originally motivated by considerations of 
econometric models, so-called “structural equations,” it is useful to 
show that the models given above are identical with those used in 
econometric models. Suppose y; and yz are “endogenous” variables, 
that is, economic variables whose formation the model is supposed to 
describe. Let x1, 22, and zs be “exogenous” variables; that is, variables 
whose values are determined by some outside mechanism. Then an 
econometric model would consist of two structural equations of the 
form $i1Yit bi2y2 = 9i0+ 8 1(@1 — £1) +: 52(22— £2) + Oi9(43— %3) + wi, 1=1, 
2(¢11622— 126210). Solution of this pair of equations gives two equa- 
tions (1) and (2); this pair is called the “reduced form.” If one of the 
structural equations, say the first, has coefficients ¢:,:=1 and 6,,=0, 
then it is algebraically identical with (4). In an econometric model this 
structural equation would have some special economic significance. If 
the economic system described by the model is changed in a particular 
way, ¥2 might be fixed, and then under this “policy change,” (4) would 
describe the formation of y; with y2 fixed. For an elementary exposition 
of these ideas the reader of this journal is referred to [4]. 

The statistical methods used in this paper are called in the econo- 
metric literature “reduced form” or “limited-information maximum 
likelihood” methods. If the experimental data were obtained in the 
same way that the extractor would be used in production (that is, 
with ye fixed), then one could use least squares procedures directly on 
(4). However, since in the experiments y2 is a dependent variable, the 
least squares procedure applied to (4) would give biased estimates, and 
one cannot use the usual confidence intervals and regions appropriate 
to the usual least squares model. The advantages and disadvantages of 
these alternative methods have been discussed considerably in the 
econometric literature; we shall not go into them here. 

A natural question to ask is why are the experiments conducted 
with temperature fixed instead of conducting them with quality con- 
trolled. One answer is that it seems more “natural” to control tempera- 
ture since controlling quality is a kind of secondary operation; that is, 
in the process it is more natural to consider (1) and (2) as the structural 
equations (as actually reflecting the chemical and physical reaction). 
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Secondly, the experiments are conducted to obtain more information 
than just what is necessary to predict yield when quality is controlled 
by varying temperature; for instance, one might consider controlling 
quality in production by varying the flow of one of the liquids. More- 
over, there are other engineering characteristics to be studied. 

There is also a relationship between the considerations of this paper 
and a point raised by Berkson [3]. Suppose one considers a pair of vari- 
ables, measured with “error” but with the “true” parts satisfying a 
linear relationship; that is, the true parts satisfy {:=c+df2, where ¢, 
and ¢2 are the true variables and the observed variables are z;={¢,+1, 
and z2={2+v2, where v; and ve are errors of observation. Berkson 
notes (see also [5]) that if z2 is controlled then 2; = {1+01 =c+d(z2—v2) 
+v,=c+dz.—dv2+, and since 2 is fixed (or, in other terms v;—dv; 
is statistically independent of z.) one can consider the above equation 
as a regression equation. This is analogous to (4). Here is another case 
where the nature of the model is changed by fixing a variable that was 
alternatively not fixed. 


4, THE EXPERIMENTAL DATA 


The data we have at our disposal were obtained from a well-designed 
experiment or set of experiments, at The M. W. Kellogg Company and 
hence are easy to analyze. 

Two levels of each of the controlled variables were used. In the table 
below we write T=1 and 2 to mean temperatures of 160° and 180°, 
A=1 and 2 to mean flows of liquid A at 2.0 and 3.5 times the flow of 
feed (relative volumes) and B=1 and 2 to mean flows of liquid B at 
2% and 5% of the flow of liquid A (by weight). The first experiment 
consisted of 16 runs. In these 16 runs four different arrangements or 
patterns of the apparatus were used, indicated as P=1, 3, 4, 5. Eight 
additional runs were made using patterns 3 and 4. We are tak'ng pat- 
tern differences to be small enough to be ignored. As a matter of fact, 
an analysis of y,, for example, for the first 16 runs indicates the pattern 
differences to be small; the F-value is actually less than one. 

Another feature of the data that shows some departure from the 
assumed model is that the variation in the later runs is less than that 
in the earlier runs; this is shown by a comparison of the error variances 
of the last 8 runs. This feature may be due in part, at least, to increasing 
skill on the part of the engineers in keeping the process “in control.” 
It would have been preferable, of course, to have had the order of runs 
randomized. 
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TABLE 


RESULTS OF 24 RUNS OF AN EXTRACTOR OF 
THE M. W. KELLOGG COMPANY 
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99.4 
99.0 
100.4 
92.6 
103.3 
93.6 
98.2 
93.4 
106.0 
99.4 
98.8 
93.5 
103.5 
96.8 
101.9 
94.1 
101.2 
98.2 
101.5 
93.7 
106.3 
96.1 
100.9 
96.4 
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5. ESTIMATION OF COEFFICIENTS, VARIANCES, AND 
CORRELATION COEFFICIENT 


Since only two levels of each factor are used, equations (1) and (2) 
could be written in the form used in analysis of variance by letting 
an(%1—#1) be Axe for 2; =180° and be Au = —Axe for z,=160°, letting 
ai2(%2— #2) be pie for 2=3.5 and be wi= —pe2 for r2=2.0 and letting 
a3(%3— #3) be vig for z3=5.0 and be vi, = — vi2 for 73=2.0. Then 


(12) Yi = aio + Aaj + oie + Le + Ui, t= 1, 2, 


for levels j of 21, k of x2, and l of zs. We shall, however, hold to the orig- 
inal notation because it is more general and hence more suitable for 
other problems of this sort. 

The estimates of the coefficients in (1) and (2) are obtained by the 
usual methods of univariate regression as 
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1 JZ , 
(13) ais = — Do yir(tiy — 23), j = 1, 2,4, 
Cj y=1 
where y(=1,-~--+, N) denotes the observation number, N =24, 2; 
7” Liti/N, and 


N 
(14) c; = Do (aj, — &))*. 

y=1 
Because of the orthogonality properties of the design >> (2j,—%;) (zi, 
—%,) =0. Finally ai=5:= > 1Yyiy/N. The variance of u; is estimated 
from 


N 
(N — 4)s2 = >> [yiy — aio — aa(tiy — Hi) 


y=1 


— Ga(Ley — E2) — ai3(Tay as #3) |? 


N 
=» yix? — Najo? ~ cyan? — C20i2? — cy0is?. 
y=1 


The correlation between wu; and wz is estimated from 


N 


‘ts 8 
(N — 4)rsise = » Yiy — Ao — >» a ;(Xjy — £;) 


y=1 L j=l = 








3 
Y2y — G20 — ZZ A2;(Xj_ — ¥;) 


S j=l 4 


N 3 
= Do vir2y — Narodeo — D> cA j02;. 


The estimates are 


Qo = 58.529, = 98.675, 

ay = — .3829, = .1558, 

Aye — 5.050, = 4,144, 

a3 2.308, = —.700, 
8 3.090, 1.619, 
r= — .6632, 


and #,=170, #2=2.75, and #;=3.5. The equations for predicting yield 
and purity in experiments are 
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Y; = 58.529 — .3829(2, — 170) — 5.050(2. — 2.75) 

+ 2.308(x; — 3.5), 
Y. = 98.675 + .1558(2, — 170) + 4.144(a, — 2.75) 

— .700(23 — 3.5). 

The estimates of the coefficients of (4) are bo) =301.058, b = —2.4572, 
bo=5.1338 and b;=.5883 and the estimate of the variance of »v is 
3:2 —2bs,ser +b’s2? = 9.068 (and standard deviation 3.011). The equation 
for predicting yield at a given purity is 
Y, = 301.058 — 2.4572Y. + 5.1338(a2 — 2.75) 

+ .5883(a2; — 3.5) 
= 58.592 — 2.4572(Y2 — 98.675) + 5.1338(22. — 2.75) 
+ .5883(x; — 3.5). 


6. CONFIDENCE INTERVALS AND TESTS OF HYPOTHESES 


(18) 


(19) 


(20) 


We base confidence intervals and tests of hypotheses on the theory 
that the pair of estimates (a;;, a2;) have a bivariate normal distribution 
with means (a1;, a@2;) and variances (0;2/c;, o:?/c;) and correlation 
p (i.e., covariance o,02p/c;). Because of the orthogonality of the design 
(ie., > (x1y—4£1)(t2y—%2) =0), one pair of estimates is independent 
of each other pair. If we want a confidence interval for a given co- 
efficient a;;, we use the usual student-t procedure; the confidence inter- 
val with confidence coefficient 1—e consists of all a;;* satisfying 

— | aij — a5* | 


(21) Ve < ty4(6), 





where ty_4(e) is the two-tailed significance point of the ¢-distribution 
with N—4 degrees of freedom at significance level ¢. If we want a 
confidence region for a pair (a1;, a2;) we use the JT? procedure; the con- 
fidence region with confidence coefficient 1—e consists of all pairs 
a;*, ao;* satisfying 


Cj [~ = oj*)? | ops — a1;*) (do; — a2;*) 4 (a - er 








1 — r? $;? 8) Se 82” 


(22) 


$2 F 
= N yes 5 2,N-5(€), 


where F2,y-5(€) is the upper significance point of the F-distribution 
with 2 and N —5 degrees of freedom at significance level e. For example, 
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a confidence region with confidence coefficient .95 for the pair of co- 
efficients of temperature is 


[an* — (—.3829) ]? , [an* — (—.3829) }[on* — .1558] 
01652 01305 
[an* — 1558]? _ 
004533 


To test the hypothesis that a:;=0 and a2:=0, we observe that a1:*=0 
and a2:* =0 does not satisfy the above inequality, and hence reject this 
null hypothesis at the .05 significance level. 

Now let us consider the coefficient 6 in (4). Any given linear com- 
bination a;,;—8*a2 is normally distributed with mean a;,;—8*a2; and 
variance [o;?—28*po102+(8*)?027]/c:1. A test of the hypothesis that 
8 =6* is equivalent to a test that a::—6*a2, has mean zero. Under this 
null hypothesis a;;—8*a2: is normally distributed with mean zero and 
independently of s;?—28*siser+(8*)’s2*. Application of the usual 
least squares theory to yi:—8*y2 shows that (N—4)(s;?—26*s,sor 
+ (8*)?s2?) /(o1? —28* o102+(8*)’o2”) has the  x*-distribution with 
N—4 degrees of freedom. Thus under the hypothesis +/c;(a1 
— B*a2)/+/s1?—28*s,ser+(8*)?s2? has the ¢ distribution with N —4 de- 
grees of freedom. The hypothesis is rejected if 








(23) 








Q(aun — *a, 2 
(24) ie ne > ty_2(6). 
8? — 28*s,sr + (B*)?s2? 
A confidence interval for 8 consists of all 8* for which the corresponding 
null hypothesis is not rejected, that is, all 8* not satisfying the above 
inequality. This interval has the end points 





C10110m — ty_4?(€)8iSer 
(25) C12,” — ty_4?(€) so? 








+ V cit w—4?(€) [ 11282? — Qader 81Ser + an?s;? | — t_4?(€)812802(1 —r?) 





os C102? — ty_4?(€) 82? 
In our case the endpoints of a confidence interval with confidence .95 
are —2.747 + .962. 

We are also interested in confidence intervals for the other coef- 
ficients in (4). Unfortunately, no method is known for obtaining an 
exact confidence interval for one of the other coefficients. However, we 
can obtain a joint confidence interval for 6 and one other coefficient, 
say §3. For any specified 8*, a13—8*a23 is normally distributed with 
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mean ai3—f*a23 and variance [o:?—28*o1029+(8*)*02"]/c; and inde- 
pendently of a1:—6*a2 (because the pair (a13, a3) is independent of 
the pair (a1, @21)). A test of the hypothesis that 6=8* and 6;=6;* is 
equivalent to a test that a1:—8*a12 has mean zero and that ai;—A*a23 
has mean a13— Bar =f3;*. This hypothesis is rejected if 


C1(G1. — B*an)? + c3(a13 — B*a23 — Bs*)? 
8,2 — 26*8,ser + (8*)*89? 

where F2,20(€) is the € significance point of the F-distribution with 
2 and 20 degrees of freedom. A confidence region for the pair (8, Bs) 
consists of all pairs (8*, 6s*) which do not satisfy the above inequality. 
The confidence region can be written as 

(Cra? + C323” — 2F's2*)(B* — m)* + 2cyd23(8* — m) 

. [B3* = (a3 “= az3m) | + C3 [B3* — (di3 — a23m) }? 
< 2F cy (ay1782? oo 178)? <a 21102181527) = 4F2s,28?(1 = r?) 
, 


= 2F220(€), 





(26) 


(27) 





C12)? — 2F so? 


where F =F. 20(€) and 


C:0110m — 2Fs,8er 





(28) m= 
C102? — 2F 8,” 


In our case, for 1—¢e=.95, we have the region 
.8839[8* — (—3.002) ]? — 1.0056[6* — (—3.002) } 
-[8;* — .207] + .7183[6,* — .207|? < 1. 


The center of the ellipse is at 8* = —3.002, 6;*=.207. The maximum 
value of 8* in the ellipse is —1.641 and the minimum, —4.363; the 
maximum value of §;* is 1.728 and the minimum —1.314. It is seen 
that there are pairs (6*, 83*) with 6;* =0 included in the ellipse, thus, 
there are null hypotheses with 6;=0 which would be accepted by this 
test procedure; that is, the procedure would lead to acceptance of some 
null hypotheses specifying 8 and specifying that z; does not really 
enter (4). 

The hypothesis that 6;=0 is equivalent to the hypothesis that 
13021 — 11093 = 0, that is, that the matrix 


(30) € a 
O21 23 


is of rank one (rank zero being regarded as impossible here). Tests of 
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a hypothesis of rank usually involve the roots of a certain determinantal 
equation, in this case 


(31) C1011? +3013? —A(N —4) 81? CyQ1021 +3013023 — A(N — 4) 8189r > 
101121 +C3013023 — A(N —4)818er  C,Q21?-+ C3023? — A(N — 4) 8? 


The roots are 3.352 and .058. The likelihood ratio criterion is 
(1+.058)-“)", The hypothesis is to be rejected (on the basis of 
asymptotic theory) if N log, (1+.058) exceeds the significance point of 
the x?-distribution with one degree of freedom [1]. In this case the 
number 24 log, (1+.058) = 1.344 which is less than the 5% or 10% sig- 
nificance point. Alternative criteria to be compared with this sig- 
nificance point (such as N X.058 or (N —4) X.058) are also small. 

In the present problem, however, the fact that we cannot reject the 
null hypothesis of 8s=0 does not seem to be sufficient reason for 
eliminating xz; from (4). The confidence region for 8, 83; admits values 
of 63 as large as 1.728. If 8; is actually this large it has engineering 
significance; that is, then the use of liquid B in the process does in- 
crease the yield of raffinate at the desired purity level. It also has 
economic significance if the cost of using a certain flow of liquid B is 
less than the profit due to increased flow of raffinate. Since the statis- 
tical procedure does not rule out the economic significance, we do not 
eliminate x; from (4). It should be noted that if the confidence region 
included only very small values of 63; (so small that admitted values 
could have no engineering or economic significance), we would be led 
to eliminate 23 from (4). 

A confidence region for all of the coefficients in (4) is given by 


N (aio — B*a20 — Bo*)? + c1(@11 — B*a21)? + C2(ai2 — B*22 — Bo*)? 
+ ¢3(a;3 — B*a23 — B3*)? 
8,” — 26*s,sar + (8*)*s2? 
S 4F 4 20(e). 


(32) 





If the above expression is expanded, the confidence region can be ex- 
pressed as 


(33) Dd 4i(B* — e:)(8* — e;) Sg, 


where the sum runs over 1=0, 2, 3 and blank (i.e., 6;*=6*). If 
€1021"/822 >4F'4.20(€), then (d;;) is a positive definite matrix and (33) 
is an ellipsoid in the space of Bo*, B2*, 6s*, and 8*. The generalized 
Schwartz inequality states 
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34) [> hidis(8* — e)}? = [D0 hidisB* — Do hidizes]? 
: s iz di(B;* — e:)(Bi* — e) [do d,jhih;| 


for any numbers h;. Thus 


35) gv >. djjh; = 2 hidiye; s _ hid;B;* 
( s g V 7 d;hih; ote Zz. hidije; 


for all h;. This fact can be used to obtain confidence regions on Ey, for 
all possible choices of y2, 22, and 2s in the production process, for 


(36) Ey: = Bo + By2* + Bo(x2* — %2) + Bs(xs* — Zs), 


when y2=Yy2", t2=23", and 23=23*. (36) is of the form > h.di;8; for 
appropriate choices of h,. The details of this procedure are too involved 
to include in this paper. However, the procedure is important because 
it allows the investigator to make confidence statements about the ex- 
pected yield in the production process for any choices of quality and 
flows of liquids A and B. 


APPENDIX. MORE GENERAL FORMULAS 


In the experiment analyzed above the factors were applied at two 
levels and the design was balanced. In this appendix we give some 
formulas for the more general case. Suppose that the two regression 
equations suitable for the experimental data are 


Pp 
(37) Yi = aio + z ai(zj — £;) + Ui, += 1,2. 
j=l 


We assume N runs are made. Let 


N 
(38) Cie = Dy (tiy — £;)(tey — 2), 

y=1 
where #;= ).42;,/N. The estimate of aio is aio=Ji= Do Yir/N and 
the estimates of ai, - - - , ai, are the solutions of 


P N 
(39) Dd cyedin = Do yir(tiy — 45) 
k=1 y=1 


for i=1, 2. The set of 2p estimates have in theory a joint normal dis- 
tribution; the expected values of the estimates are the parameters and 


(40) E(ai; — aij) (ain — an) = cio;?, 


(41) E(a; - oj) (Qo. — Ger) = ci oyo2p, 
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where the matrix (c#*) is the inverse of (cj,). The equation for predicting 
y1 when 2; is adjusted to make y, a given value is 


k 
(42) yi = Bo + Bys + 2) Bi(2s — 4;) +», 

jm2 
where fo, 8, and v are given by (5), (6), and (9) and 6;=a1;—fay,, 
j=2,---, p, and the variance of v is given by (11). The estimates 
bo, b, be, - - - , bp are found from aj; as Bo, B, Bs, - - - , 8,» are found from 
a,;;. The estimates of o;? and o;? are found from ~ 


N P 2 
(N — p— 1)s? = | [ve — Ain — Do ai;(2jy — 2) | 


y=1 j=l 
(43) ‘ 7 
= DL yix? — Nai? — Do cuaijair, 
y=1 j k=l 
and r is found from 
Nr Pp 7 
(N — p — 1)r8is2 = Do] yay — Gio — Dy anj(2jy — 4%) 
y=1 j=l al 
= RP 
(44) - | Yay — G20 — >, Gaj(t5y — £;) 
x k=l a 








N P 
- p YiyY27 — Naj0420 — Zz C je jo. 
rl $e 


The confidence interval for a; is 


| aij a ;* | 
(45) - S ty—y-1(€). 
8;c7? 





The confidence region for (a1;, a2;) is given by (22) with c; replaced by 
1/cii, with (V—4)/(N—5) replaced by (V—p—1)/(N—p-—2) and 
with F2,y_s(e) replaced by F2,y-p-2(€). To test the hypothesis 6 =6* 
or to find a confidence region for 8 we use (24) and (25), respectively, 
with c, replaced by 1/c and ty_«(e) replaced by tw_p-:(€). To test a 
hypothesis 8 = 6*, 8;=8;* we use the rejection region 
cii(ay, — B*am)* — 2c¥(an — B*an) (a1; — B*a2; — B;*) 
+ c(a,; — B*a2; — B;*)? 
(clcii — (c'/)?) (3,2 — 28*s,8er + (8*)*s,”) 
= 2F 2,n—p-1(€). 





(46) 
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To test that (a1102;—a2101;) =0, we use the roots of 
Du = A(V alt tac 1)s;? Dy = \(N = 1)8,82r 





(47) | Da — \(N = 2 = 1) sor De = \(N — 2 = 1)8,? , 
where 
448) ~~ AnAinC? — anaijcl — apjaac + a,a;j;a" ; 


clei am (c1#)? 
The derivations of these procedures are found in [1]. 
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SOME MINIMUM COST EXPERIMENTAL PROCEDURES 
IN QUADRATIC REGRESSION* 


A. pE LA Garza, D. S. Hawxnurst, anp L. T. NEwMAN 
Carbide and Carbon Chemicals Company, Oak Ridge, Tennessee 


Minimum cost procedures for multi-stage experiments de- 
termine the amount of replication at the various stages so that 
a desired estimate is obtained with a specified precision at 
minimum cost. This paper extends these procedures to experi- 
ments in which the characteristic 9 being estimated is related 
to a controlled variable z by #=a+$x+7z*. Minimum cost 
procedures are obtained for estimating 9 with at least a speci- 
fied precision over the permitted experimental range of x. The 
observations of # are assumed to be uncorrelated and to have 
known equal variances and costs independent of x. Application 
is made to problems in chemical experimentation. 


1. INTRODUCTION 


HE problem of obtaining estimates with a specified precision from 
fenton at minimum cost is becoming increasingly important 
in industrial research and development. In many of these experiments, 
costs and errors arise in well-defined stages. For example, in estimating 
a chemical concentration of a batch, the batch is sampled and the 
samples are analyzed. Both of these stages are sources of cost and error. 
Provided the costs and standard errors involved are known, minimum 
cost procedures for such experiments are easily obtained; see, for exam- 
ple, reference [3]. These procedures determine the amount of replication 
required at the various stages so that the desired estimate may be ob- 
tained with the specified precision at minimum cost. 

This paper considers minimum cost procedures for experiments in 
which the characteristic # being estimated is related to a controlled 
variable x by the quadratic relation #=a+x+~z?. Estimates of the 
characteristic 9 with at least a specified precision are desired over the 
permissible experimental range of x at minimum cost. It is seen that 
in these experiments, besides the usual question of the amount of repli- 
cation, there is also the problem of determining values of the controlled 
variable x at which observations of # should be made. A solution to 
this problem is given for uncorrelated observations of 9 with known 
constant variance and costs independent of z. 





* Work done under AEC Contract No. W-7405-eng-26. 
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2. OPTIMUM LOCATION 


The minimum cost problems considered are illustrated by the fol- 
lowing example: 

Data are collected in a pilot plant to find the relation of the expected 
chemical concentration 9 of pilot plant product with an operating pres- 
sure z. An observation of # is the measured chemical concentration y of 
the product batch of a pilot plant run. Due to equipment restrictions, 
the pilot plant must be operated in the pressure range x, to rq. After 
the data have been taken, a smooth curve is fitted to the observed points 
and an estimate of 9 for any pressure in the range r, to x7 may be ob- 
tained. It is desired that the precision of the estimates of # shall not 
exceed a specified upper limit in the range 2, to zz. 

In this experiment, the problems are to determine the required num- 
ber N, of observations y;, 7=1(1)N., and for what values of x they 
should be obtained so that the specified precision demand is satisfied at 
minimum cost. To solve these problems for a reasonable number of 
situations, the following assumptions are made: 

(1) the cost of an observation y; is independent of z; 

(2) the observations y; are uncorrelated and have constant vari- 

ance V; 

(3) the x values do not have error; 

(4) the expectation of an observation y; taken at xz is E(y;:) =9; 

(5) the 9 versus x relation is adequately represented by the quad- 

ratic 


J=at But yz’; 
(6) the estimates of 9 are given by 
Y =a-+t be + cz’, 


where a, b, and ¢ are the least squares estimates of a, 8, and y. 

Under the above assumptions (2) to (5), it is shown in reference [1] 
that if N observations y;, i=1(1)N, are taken, the observations being 
limited to the range x, to ry, the spacing of these observations which 
minimizes the maximum variance of the Y values in the range x, to 
ty is N/3 observations at xz, (cx, +2n)/2, and zy. The maximum vari- 
ance of the Y values is then 3V/N. Hence, if V, is the variance corre- 
sponding to the specified precision limit on the Y values, the required 
number WN, of observations is determined from 


V, = 3V/N,. (1) 
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For observations with cost independent of location zx, the above number 
and spacing of observations constitutes a minimum cost procedure, in- 
asmuch as the permitted maximum variance V, of the Y values cannot 
be attained with fewer than N, observations regardless of their spacing 
in the interval (x1, rz). 

It may be found rather disturbing that observations are concen- 
trated so heavily. In practice it is doubtful whether it is desirable to 
have observations at only three x’s. It may be that the data are better 
fitted by a polynomial of higher degree than a quadratic; but as long as 
observations are located at three x’s, this could never be detected. 
Fortunately, the spacing may be changed within reasonable limits from 
the minimum cost spacing without seriously affecting the precision of 
the Y values. This is best shown by an example. 

Assume that a minimum cost spacing requires nine observations; 
accordingly, three observations are located at x1, (t,+2x)/2, and zz. 
As an alternative, consider spacing the nine observations as follows: 
two observations at 2,, two observations at rq, the remaining five ob- 
servations being located at x,+k(r_—2x1)/6, k=1(1)5. These spacings 
are shown below. 


MINIMUM COST SPACING 





1 
XL (4 +%y) 72 *H 


ALTERNATIVE SPACING 
EQUAL INTERVALS 


ee 


x Xy 
If in fact J=a+fr+~7z2’, the standard error of the Y values varies 
with x for both spacings as shown in Table I. 














er 
in- 
10t 
hg 











MINIMUM COST PROCEDURES IN QUADRATIC REGRESSION 181 
TABLE I 








F(z) for N =9 








P Minimum cost Alternative 
spacing spacing 
0.0 0.577 0.569 
0.1 0.573 0.564 
0.2 0.561 0.550 
0.3 0.541 0.528 
0.4 0.516 0.502 
0.5 0.490 0.476 
0.6 0.467 0.458 
0.7 0.457 0.459 
0.8 0.467 0.490 
0.9 0.506 0.555 
1.0 0.577 0.656 





P is the deviation, positive or negative, of x from the mid-point measured in standard units; thus: 
P = | (x — dea + 21))/Acn — 21)|. 
F(z) is the relative standard error of the Y value at z; thus: 
F(z) = (standard error of Y value at z)/+/V. 

Table I clearly indicates that F(z) for the alternative spacing does 
not depart greatly from F(z) for the minimum cost spacing in case the 
4 versus z relation is = a+fr+-~z2". If the 9 versus z relation is better 
fitted by a higher degree polynomial, the alternative spacing may sup- 
ply evidence of this, while the minimum cost spacing cannot. It should 
be pointed out, however, that the spacing of two observations at each 
end point, the others being distributed at equal intervals between the 
end points, in general is not to be recommended. For nine observations, 
such a spacing compares favorably with the corresponding minimum 
cost spacing as shown in Table I, but the comparison becomes pro- 
gressively less favorable when the number of observations is increased. 

In conclusion, a procedure which should be satisfactory in most 
situations is to space the observations at more than three locations so 
that there will be little increase in the maximum standard error of the 
Y values as compared to the maximum standard error given by the 
minimum cost spacing. With this procedure, protection is provided 
against failure of the quadratic hypothesis, and if the hypothesis is 
true, little is lost in the precision of the estimates of 9. 


3. OPTIMUM LOCATION AND AMOUNT OF REPLICATION 


The experiments considered here are an extension of those in Part 2. 
These problems concern both the location of observations and the 
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replication of experimental stages. For illustration, the pilot plant 
setup of Part 2 will be followed. 

Suppose that in the pilot plant experiment the taking of an observa- 
tion y; consists of adjusting operating conditions to the pressure z, 
waiting for equilibrium, taking samples of the product batch for the 
run, and having these samples analyzed by the laboratory for the de- 
sired chemical concentration. Thus, the variance of an observation y; 
may be due to small fluctuations in pilot plant operation, some sam- 
pling error, and laboratory analytical error. Suppose further that n, 
laboratory analyses per sample are made and nz samples per product 
batch are taken. The variance V of an observation then is 


V= V3 ~ V2/ne2 + Vi/nem, (2) 


where: V; is the variance of a laboratory analysis, V2 is the variance of 
a sample, and V3 is the variance due to pilot plant fluctuations during a 
run producing a product batch. It is assumed that Vi, V2, and V3 are 
known. On the cost side, suppose that the costs are of the form f;+u,n, 
where f; is a fixed cost and u; is an incremental unit cost. For example, 
the cost of n laboratory analyses would be fi+-w4n, and similarly for 
samples, with a subscript 2, and for runs, with a subscript 3. As before, 
it is demanded of the experiment that the precision of the Y values 
shall at no place exceed a specified limit in the range x, to xy. The cost 
problem is to determine the number of product batches, the number of 
samples per product batch, and the number of analyses per sample so 
that the specified precision demand is satisfied with minimum cost. 

The solution to this problem is not at all difficult. Let V, be the vari- 
ance corresponding to the specified precision limit of the Y values. 
Since the variance of the observations y; is constant, it follows from (1) 
and (2) that 


V, = 3(V3 + Vo/n2 + Vi/nem)/N,, (3) 


where N, is the required number of observations y;, 7=1(1)N,. Also, as 
before, the spacing of these N, observations in the interval x, to xy is 
N./3 observations at x1, (rx. +2)/2, and zy. Furthermore, the total 
cost of the experiment is 


Cr=fstfeth + us, + uNine + uN nm. (4) 


The cost problem then is to find m, no, and N, that minimize C; subject 
to satisfying the expression for V,. Using the technique of Lagrange 
multipliers, the answer is readily found to be 





MINIMUM COST PROCEDURES IN QUADRATIC REGRESSION 183 
m = (Vite/Vou)*/?; ne = (Vous/V gue)*/?; 
N, = 3V3!!2[(Vgus)!/? + (Vous)!/2 + (Vis) !?2]/Veus?/?. 
The minimum total cost is found to be 
min C, = fs + fo+fi + 3[(Vaus)"/? + (Vowe)!/? + (Viws)/?]2/V,. (6) 


It is clear that here, as in Part 2, the N, observations may be spaced 
at more than three z’s at the expense of a small increase in the maxi- 
mum standard error of the Y values. 


(5) 


4, DISCUSSION 


Several comments applicable to these minimum cost problems are in 
order. Possibly the point most needing discussion is the assumption 
made throughout that the variances of the experimental stages are 
known. Unfortunately, it is not easy to determine ‘variances. With inde- 
pendent normal observations, if s? is an estimate of V with v degrees of 
freedom, s?/V is distributed like x?/v. A check with Hald’s tables, [2], 
shows, for example, that for y=30, the probability that 0.56 <s?/V 
<1.57 is 95%. It is thus apparent that the minimum cost procedures 
here given are strictly applicable only when well-established estimates 
of population variances are available. At other: times, only a rough 
preferred region is outlined. However, it is to: be noted that even 
if not all the variances required are well-established, partial use of a 
minimum cost procedure is still possible. For example, it is seen in 
Part 3 that the required number of analyses per sample depends solely 
on the costs and standard errors of analyses and samples; this number 
of analyses may be determined even though the variance of a pilot 
plant run is not known. 

As in many optimization studies, once having determined the re- 
quirements at a minimum cost point, it is well to map the behavior of 
the cost in the neighborhood of the minimum. It may well happen that 
at a slightly greater cost, the requirements may be met with greater 
convenience not easily expressed in dollars and cents. For example, a 
minimum cost point may require a number of analyses that overloads 
the laboratory capacity on that particular type of analysis and results 
in long time delays. In such a case, the cost may be investigated around 
the minimum in a direction of reducing the number of analyses. An- 
other reason for such investigation is that the variables considered are 
whole numbers. When the numbers are small and the associated costs 
are high, rounding off an optimization to whole numbers should be done 
carefully. 
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It is of interest to note that expressions for minimum total cost pro- 
cedures, such as (5) and (6), clearly point out how each factor in the 
experiment contributes to the total cost. Thus, it is possible to isolate 
the major factors contributing to the total cost, and efforts to reduce 
the costs and/or associated standard errors still further can be con- 
centrated on the controlling factors. 

In practice, costs are likely to have a linear form such as f+-un used 
in Part 3. Even if the costs are given by some more complicated form, 
approximation by f+-cn in the range of interest should be satisfactory. 
Variations of how such linear costs apply can be made easily. For 
example, in Part 2, let the cost of the N,n2n; analyses also depend on the 
N.n2 samples, so that the cost of the analyses becomes f;N.n2+uN nm. 
Such a cost for analyses could result from a preliminary laboratory 
treatment of each sample prior to running the analytical determina- 
tions. The total cost for the experiment would be 


Sa t+ fot us, + u2Nane + fiNane + wN nm, 
or 
Sa tfet usN. + (ua + fi)Name + uN nm. 


A comparison of this with the total cost given by (4) shows readily how 
the cost minimization proceeds. 

It is to be emphasized that the minimum cost procedures given here 
assume that the errors in measurements are not correlated. To say the 
least, this implies that the errors in measurements do not depend on 
observer, analytical instrument, etc. For example, some observers and 
some analytical instruments tend to have a “memory.” In such situa- 
tions, replication of measurements to decrease error is often completely 
useless. 

Finally, it may be seen that there are many situations for which the 
results of Part 2 may be coupled with expressions for costs and stand- 
ard errors of experimental stages to obtain minimum cost procedures. 
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TABLES OF PERCENTAGE POINTS FOR THE STUDENTIZED 
MAXIMUM ABSOLUTE DEVIATE IN 
NORMAL SAMPLES* 


Max Ha.perin, SamMveL W. GREENHOUSE, JEROME CORNFIELD, 
AND JULIA ZALOKAR :: 


National Institutes of Heath 


Tables of upper and lower limit on the upper 5 per cent and 
1 per cent points of the distribution of the studentized maxi- 
mum absolute deviate in normal ‘samples are presented. The 
method of computation and the'reliability of the tables are 
described, and approximations which may be used to supple- 
ment the table are derived and discussed. Examples of the use 
of the tables are given with speci’l attention devoted to their 
use for multiple significance testing on a set of means. 


1, INTRODUCTION 


ET 21, 2, --° +, 2, be independent normally distributed variates, 
L each with mean yu and variance o*. Define the studentized maximum 
absolute deviate by 

| 2: — 2| 
d= max ———-» 
t—1,2, °°°,k 8 

where ms?/o? is distributed as x? with m degrees of freedom (d.f.) and 
independent of 21, 22, +++, Ze. 

In this paper we present tables of upper and lower limits on the upper 
5 per cent and 1 per cent points of the distribution of d for varying d.f. 
The statistic d is the two-sided version of the studentized extreme 
deviate of Nair [7]. Thus if we denote by y1, yz, ---, yx the ordered 
values of 21, 22, :--, 2, 





d= max|*—?, 74). 


8 8 


where the statistics in the square brackets are the two possible ex- 
treme deviates in a sample of size k so that each has the distribution 
tabled by Nair. Both d and the Nair statistic are to be contrasted with 
the extreme deviate statistic of Thompson [10] which differs from that 
of Nair in that the scaling statistic, s, is the sample standard deviation 
and is thus not independent of the numerator. 





* The authors wish to acknowledge the valuable computing assistance of Mrs. Ina R. Hughes. 
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The tables given here may be used as the basis for a test of whether 
the largest observation, without regard to sign, is too large (i.e. as an 
outlier test, as may the related Nair and Thompson statistics). The 
tables may also be used as a basis for multiple significance tests of a 
set of k means arising from independent samples on normal populations 
with the same variances but possibly different means. The use of the 
tables as in an outlier test is illustrated by the first example of Section 
2. In our second example, use of the tables in multiple testing of means 
is illustrated when each mean arises from the same size sample. Our 
third example illustrates a discussion of the modification of computa- 
tions and use of the tables for multiple testing of means when sample 
sizes are unequal. In Section 3, results necessary for computation of the 
tables are derived, and in Section 4 some approximate procedures are 
given for obtaining approximate significance points. 


2. DESCRIPTION AND USE OF TABLES 


Upper and lower limits to the upper 5 per cent and 1 per cent points 
of the distribution of d are given in Tables 1 and 2 for varying numbers 
of variates (k) and degrees of freedom (m). The correct critical value 
lies between the tabulated upper and lower limits. From the derivation 
given in Section 3 it is clear that the lower limit is a closer approxima- 
tion to the correct value than the upper, although the difference be- 
tween the two will not often be of practical importance.' 

The selection of m as 2k arose from a consideration of the 
problem of multiple significance tests on a set of sample means, rather 
than the detection of outliers. From this point of view, one would ex- 
pect at least 1 d.f. for error from each sample. In any event, approxi- 
mate significance points can be obtained for m<k by the procedures 
of Section 4. 

Example 1. Our first example is taken from Snedecor’s Statistical 
Methods [9] and is the same as example 1 of Nair [7]. 

A randomized block experiment with four strains of wheat and five 
replications gives the following mean yield in pounds per plot. 


A B C D 
34.4 34.8 33.7 28.4 


The error variance is estimated as 2.19 with 12 d.f. Following Nair we 
compute (%—<2,)/s, where s is the estimated standard error of any one 





1 Thus, the difference between 6.97 and 6.08, the upper and lower limits for k =m =3, p =.01 cor- 
responds to a difference in probability of approximately .003. 
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4 and m=12. 


THE STUDENTIZED MAXIMUM ABSOLUTE DEVIATE 


of the means as (32.8—28.4)/+/2.19/5 or 6.7, with k 


Reference to Table 2 shows this deviation to be significant at less than 


the .01 level, since 6.7 far exceeds the upper limit for m 


of course, greater than that for m=12. 


10, which is, 


TABLE 1 
UPPER AND LOWER LIMITS FOR 5% POINT OF 


DISTRIBUTION OF d 
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TABLE 2 
UPPER AND LOWER LIMITS FOR 1% POINT OF 


DISTRIBUTION OF d 


(k =sample size, m =d.f.) 
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Example 2. Our second example, from Duncan [2], illustrates the 


kind of multiple statements that can be made conce 


t of means 


rning a se 


using the statistic d. Given the 11 means, 2.85, 3.60, 3.94, 4.05, 4.55, 
4.96, 5.03, 5.41, 5.96, 6.09, 6.97 each with standard error 0.34 (esti- 














THE STUDENTIZED MAXIMUM ABSOLUTE DEVIATE 189 


mated with 30 degrees of freedom) we compute the ratio of each of the 
11 deviations from the over-all average to the standard error of the 
mean as —5.91, —3.70, —2.70, —2.37, —0.90, +0.31, +0.51, +1.63, 
+3.25, +3.64, +6.23. Table 1 shows that for k= 10 and 15, m=30, the 
.05 critical value is no greater than 3.08 and no less than 2.83. The largest 
mean, 6.97, is consequently significantly above the mean of all 11 and 
may be considered an upward outlier. Similarly, the lowest mean, 2.85, 
is significantly below the mean of all 11 and may be considered a down- 
ward outlier. Furthermore, the two means, 5.96 and 6.09, are also sig- 
nificant upward outliers and the mean 3.60 is a significant downward 
outlier. We thus assert that each of the two lowest means is below the 
over-all average and each of the three highest is above the over-all 
average. The probability that this statement is incorrect when the null 
hypothesis is true is obvicusly .05. When the null hypothesis is false 
this probability is less than .05 (Appendix). 


Remark on unequal numbers of cases in comparison of means 


In the frequently occurring case in which the k means are based upon 
unequal numbers of cases the present tables may still be used, but with 
three modifications. (a) The deviation is taken from the weighted over- 
all mean.? (b) The ith deviation is divided, not by the standard error of 
the ith mean, but by this standard error times 


= the k k 
/ a= where 2 = Y 1,. 
n (k — 1) i-1 








(c) The tabulated upper limit is still an upper limit to the absolute 
maximum deviate so computed, but the lower limit may not be, as is 
obvious from the derivations in Section 3. In fact, the lower limit as we 
have computed it would vary with each configuration of n; and cannot 
be practicably tabulated. 

Example 8 (from Snedecor [9]). We reproduce below the average 
dressing percentage (less 70 per cent) for five breeds of swine. The 
variance of a single observation, 5.51, is estimated with 83 degrees of 
freedom. 





? For problems in which the deviation from the unweighted mean is more suitable this may be used 
but the divisor in step (b) becomes 


Po k(k — 2) ky us 
k(k — 1) nu j=l 1 
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nN; |x; —2| 


Dressing Number ‘ es 
% of ob- n 4/ =, © 
servations 











18 
45 
6 
9 
15 


93 

















Breed 1 gives a significantly higher than average dressing per cent at 
the .01 level. 
3. DERIVATION OF RESULTS AND COMPUTATION OF TABLES 


We consider in detail the special case of ¢ known. The case for ¢ 
estimated will be seen to be completely analogous. Thus let 


i=1,2,---,k. 


We then have, 


k-1 
Var z; = —— 
k 


1 
Cov se = Ss 


Thus, z:;, ---, 2 have a k-variate joint normal distribution of rank 
k—1. 
Now, for any significance level a, we define the positive number, h, by 


(3.1) 
It is obvious that (3.1) is equivalent to 
(3.2) Pr {max |z;| Sh} =1—-a. 


Hence a solution of (3.1) for h, or an approximation thereto, will 
yield a procedure for obtaining percentage points for 6=lim,,.,. d. 

It is easy to obtain approximate solutions to (3.1) by taking advan- 
tage of the fact that the correlations between pairs of the z’s is 
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—1/(k—1), coupled with elementary probability considerations. Spe- 
cifically define A; as the event |z;| >h, i=1, 2,---, k and A; as the 
event | z;| <h. We then have immediately for (3.1) by the inequalities 
of Bonferroni [4], 


k 
1 — >) Pr {A,} S Pr {AiAz - - -Ax} 
i=l 
(3.3) k k-1 k 
<1->Pr{4i}+ >> Dd Pr {4,4;}. 
i=l i=l j=i+l 
Then, since each z; has a normal distribution with zero mean and 
variance (k—1)/k, and each pair, z;, z;, has a bivariate normal distribu- 
tion with zero means, common variance (k—1)/k, and covariance 
—1/k, we can write 


a 2 * x? 
3.4a) Pr }A; | exp — —dz, t1=1,2,---,k, 
ies As} imi. 3” 


and 


(3.4b) Pr {A;A;} = 2 f f p(x, X2)dx,d2x2 
V(k/k—1) h V(k/k-—1)h 


— V(k/k-1)h set 
+2 f f p(X, X2)dx dx, 
= V(E/k-D) h 


i#j, 7, j=1, 2, +--+, k, where p(x, x2) is a bivariate normal density 
with the variates involved having zero means, unit variances, and 
correlation —1/(k—1). 

Thus, (3.3) is the basis of our computation procedure. By equating 
the left hand side of (3.3) to (1—a) and solving for h, we clearly get an 
upper limit for h, hy say. Similarly equating the right hand side of (3.3) 
to (1—a) and solving for h, we get a lower limit for h, hz say. 

To go from the case of « known to a o estimated by say, s, where 
ms?/o? is a chi-square variate with m d.f. and independent of 2, 22, 

- + 2, we need only note that 


m1 — =F te — = 
Pr { a Os +++3 <i 
8 8 


= Pr {/=—| s sah. 
o oC oC 


Thus the argument for the case of ¢ known holds with the single 
exception that h is everywhere replaced by (s/c)h so that we must 
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integrate over the distribution of s. The result is that analogous to 
(3.4) we get 





(3. 5a) Pr {A;} -f p(t)dt, 7=1,2,---,k, 
V(k/k-1) a 
and 
Pr {Aid;} =2 f pt, ts)dtdts 

V&/E-1) an J ViR7e-1) 

(3. 5b) 
—v(k/k-)a f* 
+2 f f P(t, te)dtdt, 
0 Vv (k/k—1) h 


t¥j,1,j=1, 2, -- +, k, where p(t) is the usual Student’s ¢ density with 
m d.f. and 





th? — 2ptit, + te? ]-"/2)-! 
(3.6) v(h, fs) ” 1 Ptylg “| 


1 
—" ne 1 
2rV/1 — | i m(1 — p?) 


with p= —1/(k—1). The density given by (3.6) is also derived in [3). 

For the case of ¢ known, hy was computed using the Bureau of 
Standards tables [8] of the normal integral; for o estimated, hy was 
computed using the é table of Hartley and Pearson [5] and the Pearson 
Tables of the Incomplete Beta Function [1] along with La Grangian 
3 point interpolation where necessary. These solutions, of course, are 
practically immediate. 

To obtain lower limits for h, trial values of h were guessed for each 
k and m and the value of the right hand side of (3.3) was computed. 
Except for integrals of the bivariate density (3.6) this involved only 
computations similar to those for finding the upper limit for h. For the 
integral involving (3.6) the transformation 





, 1 /rsin 0 —— 


“Wasi ee 
_ 1 /rsin@ r cos 0 
_ ali vi =) 
enables us to reduce it to 


- arctan (i—p)+p h2 e 2 —m/2 
(3.8) + oe v= p)(+) (1 4 sec *) 40, 


w J _4x/2+aretan V(1—p)(1+) m 


(3.7) 











which was evaluated by quadrature using Simpson’s rule with 8 inter- 
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vals.2 Successive trials were made using trial values of hz, to two 
decimal places until the value (1 — a) was bracketed by two trial values 
differing by .01. Linear interpolation was then employed to decide 
which of these two values would be tabled. Thus the lower limits given 
in Tables 1 and 2 are correct to two decimal places, with a few excep- 
tions. The exceptions are due to limitations of the interpolation pro- 
cedure in the tables [1]. Numerical checks indicated that for cases 
where [1] was used, lower limits are too low, but, even then, by at most, 
01. The upper limits (hy) in both tables are correct to two decimal 
places. 

The difficulties arising in obtaining significance points if we wish to 
use 2 maximum absolute deviate test in comparison of means based on 
unequal numbers of observations may be seen by considering the case 
when o? is known and equal to one, say. 




















We suppose Z;, Z2, - - - , Z, to be means of independent samples of 
size m,° °°, M( > FL ni=n) on @ normal population with zero mean 
and unit varianve. If we define 

Ny 
w; = — i=1,2,---,k 
n 
k 
#= > wih; 
i=l] 
(2; — 2)J/ni : 
2= 7=1,2,---,k 
/~ —n k 
n k-1 
easy computation shows that Var z;=(k—1)/k, i=1, 2,---, k. How- 
ever we find that 
@ — 1) / nin; a 
cov 2:2; = — , 1 ’ 
k V @m—njin—n) . 


Thus though (3.3) still holds for this case, it becomes impractical, as 
earlier remarked, to compute lower limits for h, since for each 7 andj, 
the bivariate integral could in the extreme case be different. 


4, AN ALTERNATIVE PROCEDURE FOR OBTAINING 
APPROXIMATE SIGNIFICANCE POINTS 


The inequality (3.3) is, of course, valid for any p= —1/(k—1).‘ If we 





5 For infinite d.f. (¢ known) the limit of (3.8) was quadratured as described. It is easy to show that 
the limiting integral may also be obtained by transformation of the corresponding integral of the bi- 
variate normal. 


4 For p< —1/(k—1), itis easily shown that the covariance matris of #:,° °° , s; becomesindefinite. 
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investigate the behavior of the right hand side of (3.3) for variation in 
p, it is not difficult to show that the expression is monotone increasing 
in p? (whether o? is known or estimated). It immediately follows that a 
lower limit, h,*, computed from (3.3) assuming p=0, will fall above the 
lower limit, hz, computed for p= —1/(k—1). Thus an approximate 
lower limit (for ¢? known) can be computed by equating the right hand 
side of (3.3), with p assumed equal to zero, to (1—a) and solving for h. 
This gives 


(4.1) Pr {4;} = [1-1-2], 


where Pr{ A;} is defined by (3.4a). If o? is estimated with m d.f. we can 
utilize a theorem proved by Halperin in [6] to show that (4.1) is still 
an approximate lower limit with Pr{A,;} defined by (3.5a). 

Also since (3.3) is valid for p=0, we can, from the above discussion, 
assert that h computed under the assumption of independence will lie 
between hy and h, and thus will serve as an approximate significance 
point. That is, since for this case 


Pr {A,--- Ax} = [Pr {A} ]*=1-— a, wemayuse 
Pr {A;} = 1 — (1 — a)", 





(4.2) 


where Pr{ A,} is defined by (3.5a) to compute approximate significance 
points. 

Some numerical investigation was carried out to explore the closeness 
of the approximation (4.2). It was found in all cases that the h obtained 
from (4.2) was only slightly below the tabulated upper limit. Thus even 
for the extreme case m=k=3, the h given by (4.2) is 3.94 for p=.05 
and 6.97 for p=.01 as compared with 3.97 and 6.97 in Tables 1 and 2. 
Equation (4.1) was not investigated numerically, but clearly will yield 
results close to the lower limit h_. 

Thus, it is suggested that (4.1) and (4.2) be used to obtain approxi- 
mations to critical values for m, k, and p not tabulated. 


APPENDIX 


In connection with example 2 of Section 2 we wish to show that the 
probability of our composite assertion being wrong in any respect is at 
most a whether or not the null hypothesis is true. Such is, in fact, the 
case for more general situations than have been considered here and 
may be proved, almost trivially, as follows. 

Let 2, %2,°-°+, 2% be a set of variates with means m1, po, ---, 
we, unit variances, and arbitrary correlation matrix. Define h by 
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—-hS(@i-u) Sh 
(A.1) Pr f ; ( }a1-« 
4=1,2,---,k 
Now let the first r of the k means be positive, the next s—r negative, 
and the remaining k—s zero. (A.1) may then be rewritten as 
—hsa;—piSh; —hS2xj—pjSh; hSa,Sh | 

t=1,2, ---,r j=rtl,:--,s l=st+l1,---,k 

=1l-a. 


(A.2) Pr { 


On the other hand the total probability of avoiding error is 


(A.3) oC eo; — © Saj—wSh— pj; 
#=1,2,---,r j=rtl,---,s 
-heumsh 

widapeter ca 


since when y; is positive we can be in error only when 2; S —h and when 
u; is negative we can be in error only when 2;2h. Since every event 
covered in (A.2) is also covered in (A.3), the probability (A.3) is 2 
than the probability (A.2). The probability of avoiding error is thus 
21-—a and the probability of error in consequence Sa. 


REFERENCES 


[1] The Biometrika Office, Tables of the Incomplete Beta Function, London, Uni- 
versity College (1948). 

{2} Duncan, D. B., “A significance test for differences between ranked treat- 
ments in an analysis of variance,” Virginia Journal of Science, 2 (new series) 
(1951), 171. 

[3] Dunnet, C. W., and Sobel, M., “A bivariate generalization of student’s 
t-distribution with tables for certain special cases,” Biometrika, 41 (1954), 
pp. 153-69. 

[4] Feller, W., An Introduction to Probability Theory and Its Applications, New 
York, John Wiley and Sons (1950), p. 75. 

[5] Hartley, H. O., and Pearson, E. S., “Table of the probability integral of the 
t-distribution,” Biometrika, 37 (1950), 168-72. 

[6] Kimball, A. W., “On dependent tests of significance in the analysis of 
variance,” Annals of Mathematical Statistics, 22 (1951), 600-2. 

[7] Nair, K. R., “Distribution of the extreme deviate from the sample mean,” 
Biometrika, 35 (1948), 118-44. 

[8] National Bureau of Standards, “Tables of normal probability functions,” 
Applied Mathematical Series, 23 (1953). 

[9] Snedecor, G. W., Statistical Methods, 4th edition, Ames Iowa, The Collegiate 
Press (1946), p. 266 and p. 282. 

[10] Thompson, W. R., “On a criterion for the rejection of observations and the 
distribution of the ratio of deviation to sampling standard deviation,” 
Annals of Mathematical Statistics, 6 (1935), 214-19. 











ESTIMATION OF THE PARAMETERS OF A SKEWED 
DISTRIBUTION BY LINEAR SYSTEMATIC STATISTICS 


A. E. SARHAN 
University of North Carolina 


1. INTRODUCTION 


N RECENT literature, linear combinations of the sample ordered values 
I are used to provide estimates of the parameters of certain distribu- 
tions [3, 4, 5, 10]. These estimates and the general class of statistics 
which are derived from order statistics are termed systematic by 
Mosteller [6]. The efficiencies of these estimates and of some other 
linear estimates have been discussed for certain symmetric distribu- 
tions [3, 7, 10]. 

It has been shown, in the case of the rectangular population [5, 10], 
that the two extreme values (with equal numerical weights) provide us 
with the best linear estimate of the mean and standard deviation. In 
Sections 2 and 3 below it is shown that the two extreme sample ele- 
ments will have the largest weights in the best linear estimates of the 
mean and standard deviation of a given skewed distribution with finite 
range. It is also shown that for small samples the midrange and the 
range (which are based on the two extreme sample elements) as esti- 
mates of the mean and standard deviation for the given distribution can 
be used without loss of efficiency. The efficiency of these estimates, and 
of some other linear estimates, is discussed in Sections 4 and 5. 

In making the computations, the variances and covariances of the 
order statistics were first computed as described in Section 2. The 
method of least squares [1, 9] was then used to obtain the best linear 
estimates of the parameters of the distribution. In applying this method 
to problems of this kind, it is convenient to employ matrices and vectors 
to compute the coefficients in the estimating functions and the variance 
of the resulting estimates. The operations are outlined in Sections 6 
and 7. 


2. THE BEST LINEAR ESTIMATES OF THE PARAMETERS OF AN 
ILLUSTRATIVE SKEWED DISTRIBUTION 
Consider the skewed distribution, 


12 y—O, 2\2/ 1 y—h 264 62 
w = +5)(=- ), hie, 
ae ( @  3/\3 6 ae Dental 








where @, is the true mode and 6, is the true range. Let 
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(2.2) Yi, Y2, Y3, °° * y Yn 


denote a sample of size n drawn from this distribution and ordered so 
that 


(2.3) WS yz <S Ys S Yn 
Now consider the following linear combination of these ordered sample 
values: 


(2.4) 0% = > aii, j= 1,2. 
i-1 


where 6;* is an estimate of the parameter 0;. The coefficients in (2.4) can 
be found by the method of least squares so as to minimize the sampling 
variance; i.e., so that 6;* is the best linear estimate of the jth parameter. 


Let 
(2.5) z= 





Then the distribution of z is 
(2.6) f(z) = 12 2°(1 — 2), 0s2781. 


For the rth order statistic from distribution (2.6), the moment of 
order k about the origin is defined to be 


. n! 1 . Zr rl 
@.7) Bee) = 7 _—— fa] f “povae| 


flat) | f fads dn 


where E stands for expected value. The second moment about the true 
mean is 


(2.8) V(a,) = E(2,*) — [E(z2,)}*, 


where V stands for variance. The product moment about the origin, for 
the order statistics z, and x, where z,<2,, is defined to be 








n! 1 z, 
CH Ta) +--+. ae rt | Se 
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The product moment about the true means is 
(2.10) Cov (2,2,) = E(2z,2,) — E(2-)E(2,), 


where Cov denotes covariance. 
Finally, the relationships between the lower moments of z, and y, 
are given by 


(2.11) E(y;r) = 6+ o| BU) ~ =|, 


(2.12) Ely, — E(y-)]? = 622{ E(2,*) — [E(2,) ]?} 
(2.13) Elyy. — E(y:)E(ys)|] = 6:2[E(2-t.) — E(x, E(2.)], r ¥ s. 


All the expected values and the variances and covariances for order 
statistics for samples from distribution (2.6) have been computed for 
n as large as 5. Table 1 gives the exact expected values for n=2, 3, 4, 


TABLE 1 


EXACT EXPECTED VALUES OF THE ORDER STATISTIC z, IN 
SAMPLES FROM THE SKEWED DISTRIBUTION (2.6). EACH VALUE 
MUST BE DIVIDED BY THE APPROPRIATE 
DIVISOR GIVEN IN THE LAST COLUMN 











n 1 2 23 w% Zs Divisor 
2 | 17 25 35 
3 | 2127 3039 3843 5005 
4 32799 46239 57087 68079 85085 
5 115433 161449 197097 236153 265837 323323 








5. Table 2 gives the exact variances and covariances. The coefficients 
a; and a»; for the best linear estimates of the mode, and the range for 
samples up to the size 5 are given in decimal form in Tables 3 and 4, 
respectively. These estimates can be computed by substituting these 
coefficients in equation (2.4). The variances of 6;* and 6@.* are shown in 
terms of 4,”. 


3. THE BEST LINEAR ESTIMATE OF THE MEAN AND STANDARD DEVIATION 


The mean of distribution (2.1) is 6; —62/15, and its variance is 6.27/25, 
so that estimates of the mean (u) and standard deviation (¢) can be 
computed from the equations 
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TABLE 2 


EXACT VARIANCES AND COVARIANCES OF THE ORDER 
STATISTIC z, IN SAMPLES FROM THE 
SKEWED DISTRIBUTION (2.6) 














n r 
1 2 3 4 5 
2 1 113d: 48d, 
2 85d: 
3 1 621011ds 319824d; 159264ds 
2 514219d; 258048d; 
3 396501ds 
4 1 151151299d, 83702736d, 50774880d, 27126528d, 
2 128959619d, 78637632d, 42148128d, 
3 109419411d, 58948560d, 
4 86105899, 
5 1 28426707965d, 16345757040d, 10697825376d, 6963908160d; 3872558976ds 
2 24580357885d, 16151561280d, 10539098592d, 5870339904ds 
3 21524077725d, 14090885424d, 7867132416d; 
4 18594827405d, 10422379968d; 
5 14844421735ds 





d: = 1/3675, ds =1/25050025, di =1/7239457225, ds =1/1568066434935 


(3.1) u* = 0,* — 0.*/15, 
(3.2) o* = 6,*/5. 


These estimates are of the linear form 


TABLE 3 


COEFFICIENTS a; IN THE BEST LINEAR ESTIMATE OF THE 
MODE (@,) IN SAMPLES OF THE SKEWED 
DISTRIBUTIONS (2.1) 








Order statistic 








n 
Yi Y2 Y3 Y4 Ys 
2 . 208333333 .791666667 
3 .232616044 .133288817 .634095139 
4 .226185200 .097200059 .113992303 .562622438 
5 . 21689712 .08170106 .08012333 .10172209 .51955640 





n 
a* = > anys 


=i 





i 
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n 


(3.3) u* = >> Buys, 


i=l 


Dy Bayi. 


t=] 


(3.4) o* 


The coefficients 6,; and 82; are given in decimal form in Tables 6 and 7, 
respectively. It can be shown that these are the coefficients for the 
best linear estimates of u and g. 


TABLE 4 


COEFFICIENTS a, IN THE BEST LINEAR ESTIMATE OF THE 
RANGE (@:) IN SAMPLES FROM THE SKEWED 
DISTRIBUTION (2.1) 








Order statistic 





Yi Y2 Ys Ys Ys 





—4.575000 4.575000 

—2.771360652 -310130814 3.081491422 

—2.162396264 -398146805 —.009109328 2.569652397 

— 1.82792882 .39570250 —.15297402 .09603220 2.2805734 


oe © bd 





n 
6* = > anys 
i- 


4. EFFICIENCIES OF OTHER LINEAR ESTIMATES OF THE MEAN 
AND STANDARD DEVIATION 


Tables 8 and 9 show the relative efficiencies of other illustrative 
estimates of the mean and standard deviation of distribution (2.1). In 
constructing these tables, the following sample statistics were first 
computed: 


TABLE 5 


VARIANCES OF THE ESTIMATES @,* AND @,.* IN SAMPLES FROM 
THE SKEWED DISTRIBUTION (2.1), IN TERMS OF @,? 











n 0,* 6.* 

2 .020138888 .531250000 
3 .012478916 . 236345228 
4 .008943489 . 145318720 
5 .006914030 . 102566451 
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TABLE 6 


COEFFICIENTS @,; IN THE BEST LINEAR ESTIMATE OF THE 
MEAN (x) OF THE SKEWED DISTRIBUTION (2.1) 








Order statistic 











Yi Y2 Y3 Ya Ys 
2 5 6 
3 .417373421 .153964205 .428662378 
4 .370344951 .123743179 .114599591 .391312278 
5 .33875904 . 10808123 .09032160 .09531994 .36751820 
ut = z Biss 
™ 
TABLE 7 


COEFFICIENTS #; IN THE BEST LINEAR ESTIMATE OF THE 
STANDARD DEVIATION (c) OF THE SKEWED 
DISTRIBUTION (2.1) 








Order statistic 




















"1 Y2 Y3 Ys Ys 
2 — .8750 .8750 
3 — .554272130  — .062026163 .616298284 
4 — .432479253 —.079629361 — .001821866 .513930479 
5 — .36558576 —.07914050  —.03059480 .01920644 .4561145 
co= > Brus 
i“ 
TABLE 8 
EFFICIENCIES OF OTHER LINEAR ESTIMATES OF 
THE MEAN IN SAMPLES FROM THE SKEWED 
DISTRIBUTION (2.1) 
i Sample mean Adjusted median Adjusted mid- 
éu* $12" range ¢13* 
2 100 .00% 100 .00% 100 .00% 
3 97.41 61.29 98.59 
4 95.91 68.59 97.61 
5 91.36 51.58 93 .05 
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TABLE 9 
EFFICIENCIES OF OTHER LINEAR ESTIMATES OF 
STANDARD DEVIATION FROM THE SKEWED 
DISTRIBUTION § (2.1) 














Adjusted range Adjusted “normal Gini’s estimate 
™ on* estimate” ¢22* $23" 
2 100 .00% 100 .00% 100 .00% 
3 99 .57 99.57 99.57 
4 98.84 98.56 97 .64 
5 97 .96 97.40 95.51 
1 n 
(4.1) Mean: On* =— > yi 
nN j=1 
, Yi/2(n+1) if n is odd. 
(4.2) Median: 6.* = \ ‘one ee 
2(Yasn + Yams) if n is even. 
(4.3) Midrange: O13* = 3(y1 + Yn). 
(4.4) Range: 6* = Yn — Yi. 


p » Vi¥i- 


i=l 


(4.5) Normal Estimate: 62.* 


7 2 
(4.6) Gini’s Estimate:  623;* = eT [2u — (n+ vol} 

8 (n(n — 1) 
where the y,’s in (4.5) are the coefficients such that 62.* is the best 
linear estimate of the standard deviation of a normal population [3], 
and u and v in (4.6) are defined by the equations 


n 


(4.7) u= Diy v= Sy. 


i=] i=l 


Unbiased estimates of the mean and standard deviation of distribution 
(2.1) were then computed from the statistics. 


(4.8) m* = Oy", 

(4.9) wo® = O* + C,02*, 

(4.10) u3* = 0,3* — C,'00*, 

(4.11) oi* = 0y*/[E(2n) — E(x) |, 
(4.12) o2* = K,622*, 


(4.13) o3* = Oo3*, 
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where C;,, C,’ and K, are constants which can be calculated for each 
value of n to make the estimate unbiased. 


5. DISCUSSION 


Tables 3 and 6 show that for estimating the mode and the 
mean, the two extreme sample values should be assigned the greatest 
numerical weights, while the other values should have smaller weights. 
Tables 4 and 7 show a similar situation for estimates of the range and 
standard deviation. 

It is of interest to see that the least sample value (the extreme value 
on the side of the long tail) has a smaller coefficient than the largest 
sample value (the other extreme on the side of the shorter tail). This is 
to be expected, since extreme values from the longer tail occur more 
often and tend to upset the estimate. It throws some light on the effect 
of the shape of the distribution (or its tails) on the coefficients of the 
best linear estimates. The nature of this effect presents an interesting 
but rather difficult problem. 

Tables 8 and 9 show that for distribution (2.1), the efficiency of the 
sample mean decreases more rapidly than the efficiency of the normal 
estimate as the sample size increases, although they are the best linear 
estimates of the mean and standard deviation, respectively, in samples 
from a normal distribution. A similar observation applies to the mid- 
range and the range, although they are both based on the extreme 
sample values. 

Table 8 shows that the efficiency of the midrange is higher than that 
of the sample mean, and that the efficiency of the median is low. So 
the midrange and the sample mean can be used as inefficient estimates 
for estimating the mean of the distribution (2.1), while the median is 
unreliable for such estimation. 

From Table 9, it is clear that the range has a high efficiency in small 
samples from this distribution. The efficiency of the normal estimate 
is nearly the same as the range, but there is no advantage in using it, 
since the range is easier to compute, and is slightly more efficient. Gini’s 
estimate is not as efficient as the other estimates, but it may be useful 
in view of its very simple coefficients and the fact that it is unbiased 
regardless of sample size. 

From this investigation, it is evident that the calculations of the 
optimum coefficients are tedious even for small samples, and will be 
very difficult to work out for larger samples. It seems that the method 
that Jones [4] has recently published will be convenient to estimate the 
mode avoiding tedious calculations. In fact, he used the distribution 











204 AMERICAN STATISTICAL ASSOCIATION JOURNAL, MARCH 1955 


discussed in this article to illustrate his method calculating the co- 
efficients for estimates of the mode for the case where n = 3. The method 
used by Jones deals with estimates of location parameters only and will 
not be satisfactory for distributions that do not have a rounded mode. 

However, after this good start by Jones it is hoped that more in- 
vestigations may lead to a general theory or approximation for estimat- 
ing the parameters of any given distribution. 


6. THE EXTENDED METHOD OF LEAST SQUARES 


When the expected values of observed random variables are linear 
functions of unknown parameters, the principle of least squares pro- 
vides equations the solution of which leads to estimates of the parame- 
ters with certain optimum properties. 

Assume we are given n observations, 4, Y2, - °°, Yn, With expected 
values that are linear functions of s<n unknown parameters 4,, 
02, ++, 0,, so that 


(6.1) E(yr) = Z, a, ;9;, T= a 2, ee n, 
j=l 

where E(y,) denotes the expected value of y,. In other words, writing 

A for the matrix of n rows and s columns with elements a,;, and 6 for 

the column vector with elements 6;, let us assume that 


(6.2) E(y) = 
Also, assume that 
(6.3) Vi_y) =V 


where V( y)i is the variance matrix of the vector Y; with elements known 


apart from a scalar factor. Then under these two conditions, (6.2) and 
(6.3), the extended principle of least squares selects the estimates 0,*, 
6.*, - - - , 6,* so as to minimize 


(6.4) (y — Ad)’V-(y — Ad), 


where (y — Ad)’ is the transpose of ( y —A@), and V~ is the inverse of V 


with respect to 6,, 62, - - - , 8, considered as independent variables. 
Differentiating with respect to 6,, equating to zero, and accumulating 
for all values of 7, we get [9] 


(6.5) A'V~-ly = (A’V-'A)6*; 
that is, 
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(6.6) 6* = (A’V-'A)A'V-ly, 


where 9* is the vector of the best linear estimates. 
The variance matrix of the estimates of the parameters is given by 


(6.7) V (6%) = (A’V-1A)-1. 


Furthermore, given the general solution (6.6), the extended principle 
asserts that if new parameters, ¢1, ¢2, - - -, ¢s, are introduced by the 
equation 


(6.8) ¢ = Ma, 
where M is a matrix with s columns, then the least squares estimate of 
gis 


(6.9) ¢* = Mé&* = M(A'V-1A)“1A'V-y. 
The matrix of variances and covariances of ¢* is evidently 
(6.10) M(A'V-A)-M’, 


7. INVERTING THE VARIANCE MATRIX 


The major part of the calculation is in inverting the variance matrix 
V, with elements v;; equal to the variances and covariances of the order 
statistics. The method used can be summarized as follows: 

The matrix V is expressed as the product 7’ T where T is annXn 
upper-triangular matrix, and 7” is its transpose, the elements of T 
being such that 








(7.1) tii; = >0 +=1,2,---,n, 
(7.2) t;; =0 ift > J, 
(7.3) Vij = Do bribes 
r=) 
so that 
Tin =e Ms tt Un | 
Vig Vee =a3 * * * Van 

Vin Von U3n * * * Unni 

Tin 0 0 ---O WPtin te tsa-+: tin] 

tio te O ---0 0 too = tag + + + bon 


¢e28#?t+ #* #£#£ 4+ 2.OUtthmlUc SLO Oh hUmHEhlCUC HU OlhUCUC hhUmUh.UhUCU 














Ltin lon tn tls tan _O 0 0 --:: tan 
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The elements of 7 are calculated from 


(the = 0un/tu (k= 1,2,---,n) 

tor = (Vor — to/tr) /tee (kK = 2,3,---,m) 

(7.4) 4 tse = (vse — hate — tester) /tss (k = 3,4,---,n) 
ten? = Onn — bie? — toe? — > + +> — tery?. 





As a check, we can evaluate both sides of the equation 


(7.5) d on = ty Dd thy + tee D> te; + ce + ter >> te3. 

j=l j=l jm? jak 
The inverse of T is an upper-triangular matrix 7T~' whose diagonal ele- 
ments are the reciprocals of the corresponding diagonal elements of 7’. 
The reciprocal of V is obtained from 


(7.6) TV = (T')’. 


In using (7.6) to calculate V-, there is no need to find T-'. The number 
of different elements in T-! is 4n(n+1). Of these, $n(n—1) are zeros, 
and n are diagonal elements (each being the reciprocal of the corre- 
sponding diagonal element of 7). The solution of 3n(n+1) equations 
provides us with values of 4n(n+1) elements of V— and the other ele- 
ments can be obtained by symmetry. 

The following is a simple example illustrating the method. Suppose 
we wish to find the inverse of the symmetric matrix 


D1 








££ 0 28 07 35 
y= 0 28 0 196 224 
28 0 196 0 224 
LO 196 0 1588 1784 


Using (7.4), we get 


ds 


V7 0 4/7 0 5/7 

—_ 0 2/7 0 14/77 16/7 
0 0 2/21 0 » 24/21 

L 0 0 0 676 66 
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where >>; and > 2 are the sums of the corresponding rows of V and T 
and can be used in (7.6) as a check. By (7.6) we have 


r/7 0 477 OFpett vi? pls yl 
0 2/7 O 14./7 |} vi? v2? 23 p24 
0 0 2/21 Ol vt p23 p38 p34! 
| O 0 0 6V6ILv¢ v4 y34 yA4 





4 - 
— 0 0 0 
ti 

1 
? — 0 O 
=f toe 
1 
? ? — 0 
t33 
1 
? ? 2? — 
a ta 








where the v;; are the elements of V~', and where the question mark (?) 
represents an element in (7'-')’ that need not be calculated for finding 
the inverse. The last diagonal element in (T-')’ is equal to the last row 
of T multiplied by the last column in V-. Thus. 
1 - 

pe 44. 

6V6 6/6044; 
that is, 

1 


on arene 


216 


Proceeding up, we have ten equations in ten unknowns whose solution 
provides us with 10 elements of the inverse. The other elements can be 
obtained by symmetry. 

The equations are in general given by 


( Dd: tigvti 





r=i+l 
re fori #j 
’ tii . 
(7.7) v= 4 . 
> t;,v7? 
1 rmi+l — 
th? = a for i = j- 
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The inverse of the matrix in the example, therefore is, 


— 1 — 
1/3 o-— 0 
/ 21 
397 7 
1512 216 
V- = 
1 1 
-— 0 — oO 
21 84 
7 1 
annem So «- 
i 216 216. 








The method here described is a modification of the standard square 
root method [2] and involves less calculation (since the non-zero off- 
diagonal elements of 7 are not required). It also leads to computed 
values of the elements in the inverse that are usually more exact. 

The author would like to express his acknowledgment to Dr. B. G. 
Greenberg for his help and supervision. 
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A FAMILY OF J-SSHAPED FREQUENCY FUNCTIONS 


CueEsterR W. Torr 
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AND 
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1, INTRODUCTION 


HE usual procedure in obtaining theoretical distributions is to ob- 

tain a statement of the frequency function, f(z). The distribution 
function, F(x), may be expressed as a definite integral. One may adopt 
the opposite viewpoint and obtain a distribution function, F(x), from 
which the associated frequency function, f(x), may be obtained by dif- 
ferentiation. F(x) will be referred to as a cumulative frequency function 
(c.f.f.). The latter viewpoint was discussed by Burr in [1]. 

It is the purpose of this paper to give a family of c.f.f.’s which is 
believed to be new, which is readily handled from the calculational 
standpoint, which has useful values of a3 and ay, the standard third and 
fourth moments, including the range of values for failure data men- 
tioned below, and which yields J-shaped frequency functions. A graph 
is presented which shows the values of a3? and 


_ 204 — 3as* — 6 
a + 3 


assumed by members of the family, and some examples are given which 
show that a satisfactory fit of empirical data may be obtained by the 
use of these functions. 

A collection of some sixty sets of empirical data with J-shaped histo- 
grams reveals several distributions with a3 and a, values which fall 
under Pearson’s Type I classification. Of these, there is a definite group 
dealing with failures such as frequency of powered band tool failures, 
frequency of automatic calculating machine failures, and frequency of 
failure at time x of radar components. These sets of failure data have 
a; values ranging from .57 to 1.66 and 6 values ranging from —.47 to 
—.58. There appears to be a justification, therefore, for an examination 
of theoretical curves which yield such values. 





2, A FAMILY OF CUMULATIVE FREQUENCY FUNCTIONS 


The family of c.f.f.’s considered here is defined as follows: 
209 
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r _ a ~ x 
(=) = 5 te — 2%) +0-9— 


() F(x) = 0, x<0, 


F(z) = 1, x > b, 


where 0<r<1 and 0<aSl. 
The corresponding family of frequency functions obtained by differ- 
entiation of (1) is given by 
l-a 


2ar 
= (6 — x)(2br — x*)-* + . 


(2) f(x) = ry 5 


Two further differentiations yield the functions 


2ar 
(3) f'@) = = be — aty-*[(b — 2)*@r — 1) — BF, 


4ar 
(4) f(x) = ms (b — x)(r — 1)(2ba — ax?)"-3[(b — x)?(2r —1) -— 3b?]. 


From (2) it is seen that f(x) is positive for 0<x2<b; from (3) f’(z) is 
seen to be negative for all values 0 <x <b; and from (4) f’’(z) is seen to 
be positive for all z such that 0<2<b. Thus all frequency functions of 
the family (2) have graphs which are positive between 0 and b, have 
negative slopes between 0 and b, and for each of them the slope is al- 
ways increasing between 0 and b. These curves thus may be called 
J-shaped. From (2), however, it can be seen that the frequency curves 
of the family do not drop to zero at the right extreme of range except in 
the case a=1. Some curves of this type, having a positive ordinate at 
the right extreme of range, are called U-shaped by Pearson. 

Cumulative moments M;, are given by 


(5) M;, = f “(dl — F(z) |dx, 


where the c.f.f. F(x) is used. In addition, 
pa’ = My 
a= 2M; — M-? 
b= 3M, - 6M,M, + 2M? 
wus = 4M; — 12M2My + 12M\M,2 — 3Mo', 


(6) 


where the y’s are the ordinary moments about the mean for the fre- 
quency function f(z). 
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Using (5) and some straight-forward integration, one may obtain for 
this family of functions 





(7) M =o [aN +o 
| . k+1) k+2)’ 


where the NV; are the cumulative moments for those members of the 
family (1) having a=1 and the range from zero to one; that is, for the 
c.f.f.’s 


F(a) = (22 — x?)" 0s 28 1. 
The N; are calculated from (5) and are given by 


N 


M 
Ne 


N; = 





- — 4R + 3W, 
ery 2(r + 2) - 


where 





r(2 2 1 1 
R= nd ~~? =< (5) r+1) 


ert eS 2 
T?2 
2 


vy T'(2r + 4) (> + -) 
Lo 2 Qr +3 
2 
Since M;, contains the factor b*+! and no other terms containing ), it 
can be seen from (6) that a; and a, are independent of b. However 


1 
wf (“= 1 aR), 


bt * + 2a 1 a 2aR | (u!)? 
3 r+1 _ 


y 
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The standard moments a; and a, thus depend on the two parameters a 
and r and may be computed from the following formulas with the aid 
of (7) 





3M, — 6M,M, + 2M? 
(10) > ’ 
(2M, — M,?)*/? 
4M, — 12M.M, + 12M,M,? — 3M! 
(11) ay = 
(2M, — M,?)? 
Following Craig [2], values of a;? and 
2a, — 3a;? — 6 
a + 3 


are used as coordinates in order to plot the members of the family of 
c.f.f.’s given by (1), thus affording a comparison with the Pearson sys- 
tem of frequency curves. The a;? and 6 values assumed by members of 
the family are given in Figure 1. This is a graph of a3? and 6 versus a 
and r. It covers a portion of the original Craig chart. If a=r=1, the 
frequency function is triangular and results in the minimum 6= —.40. 
An upper bound for 6 may be found by taking a=1 and letting r ap- 
proach 0. This yields a 5 value of approximately —.81. It might be 
pointed out here that each point on the chart corresponds to an infinite 
number of members of the family (1) obtained by varying b. For a 
given empirical distribution, after the sample a;? and 6 are calculated, 
an a andr can be selected from the contour lines in the graph. A more 
complete comparison between a;? and 6 and a and r is presented in 
Table 1. A linear interpolation may be used between values in the table. 





g = 





3. FITTING EMPIRICAL DATA 


The method of fitting a theoretical c.f.f. to a given set of data has 
been described by Burr in [1] and by Hatke in [5]. For the family of 
c.f.f.’s given by (1), a brief description of the method is given here. Con- 
sider only those members of the family where b=1, i.e. consider the 
family 
F(x) = a(2x — x*)" + (1 — a)z, 6s26 1, 
F(z) = 0, z <9, 
F(x) = 1, ' <-e>4 
Calculate a3 and a, for the given set of data and use Figure 1 or Table 1 
to obtain values of a and r to use in (12). For this chosen c.f.f., calculate 


(12) 
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TABLE 1 
VALUES OF a, 6 FOR a, r F(x) =a(2z—2*)'+(1—a)z 








-02 d -08 
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TABLE 1—(Continued) 
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ui’ and a by use of (8) and (9). Change the scale and origin of the values 
of the variable X of the given data to those x’s corresponding to (12) 
by means of 


xz — py’ xXx-2X 


o 8 





(13) 


where X and s are the mean and standard deviation of the given data. 
The z values to use in (12) are thus seen to be 


Co So ~<a 
(14) -2x+(w -= x). 
8 8 


The X class limits are substituted into (14) and corresponding values 
of (12) are calculated and differenced to give the probabilities for the 
given ranges of X. These probabilities may be multiplied by the total 
frequency to yield theoretical frequencies. 

An example of the fitting process is given here using Davis’ data [3] 
on the lifetime in hours of transmitter tubes “sed in aircraft radar 
sets.! 

For this distribution X=149.480, s=139.539, a;?=1.587 and 
§= —.505 are obtained empirically. The graduation was done by using 
r=.40 and a=.91 in (12). These values were read from Figure 1 which 
was entered with the values of a3? and 6. Thus, the c.f.f. used for gradua- 
tion was 


(15) F(x) = .91(2x — x*)-4° + .09z. 
By using equations (8) and (9) and b=1 we get 

x I(2.8) 

~ 22-8 72(1.9) 


a+l 
py’ = 3 — aR = .211005, 


(- = + = 2 R) (u1’)? F 239776 
= — a — = . " 
° | 3 r+1 - | 


= .817578, 





Substitution in equation (14) gives us 


x = .0017183X — .045844 





1 Davis presented thirty distributions. Approximately fifteen distributions were tested. Half of 
these fell into the range of this J-shape, and better than half of these proved to have a good fit by this 
distribution. 
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TABLE 2 
HOURS BEFORE FAILURE OF TRANSMITTER TUBES 








Graduated 
Lifetime Class Class Graduated frequency 
in hours limits limits F(z) frequency _ by ex- Observed 
X in X in z by F(z) _ ponential 
[3] 


25 112.0 94.7 


frequency 





75 63.9 68.1 68 
38.5 50.2 48 
28.3 34.0 31 
40.2 43.8 42 
26.7 22.2 21 


500 27.3 24.0 27 





Total 336 .9 337 .0 337 





into which one may substitute the X class limits 50, 100, 150, etc. The 
values of x obtained are substituted into (15) to calculate the F(z) 
values which when multiplied by 337 and differenced give the theoreti- 
cal frequencies found in column 5 of Table 2. 

If the x? comparison test is used, the fit obtained by the use of F(z) is 
good. One obtains x?=5.499 whereas P(x?>4.878)=.30 and P(x? 
> 5.989) =.20 for 4 degrees of freedom. The corresponding figures for 
the exponential graduation as given in [3] are x?=1.13 and P=.95 for 
5 degrees of freedom. 


4. SOME EXAMPLES 


In this section a few examples are given to show the satisfactory fit 
which may be obtained with F(z). The first example is taken from [3] 
and concerns the lifetime in hours for 100, V600 indicator tubes used in 
aircraft radar sets.? The c.f.f. used for this graduation was 





: 2 The value, X =1000, in Table 3 differs from that indicated in Davis’ original paper in which 
the highest interval was left open. The actual data were obtained from the author and this last interval 
is closed. 
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TABLE 3 
HOURS BEFORE FAILURE OF V600 INDICATOR TUBE 








Graduated Graduated 
frequency by frequency by 
exponential [3] 


Lifetime in Observed 
hours X frequency 





50 29 
150 22 
250 12 
350 10 
500 10 
700 9 

1000 8 





Total 





F(x) = .83(2¢ — x?) .36 + .17z 


and the fit may be measured by x?= 1.649, giving P=.80 for 4 degrees 
of freedom. The exponential graduation gave x?=2.48, P&.78 for 5 
degrees of freedom. 


The last example is taken from the data given by Elderton in [4]. 
These data relate to six months’ experience of maturities among en- 
dowment assurances and are graduated by Elderton with a Pearson 
Type VIII curve. 


TABLE 4 
MATURITIES AMONG ENDOWMENT ASSURANCES 








Graduated Graduated 
frequency by frequency by 
F(z) Type VIII [4] 


Observed 
frequency 





469 468.1 437 
186 193.4 222 
166 153.4 165 
134 137.3 136 
122 127.3 120 
112 109.5 109 





1189 1189.0 1189 
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The c.f.f. used for the graduation was 
F(x) = .46(2x — x?) .20 + .542 


and yields x?=1.677, whereas P(x?>1.424)=.70 and P(x?>2.366) 
=.50 for 3 degrees of freedom. 
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BASIC PROBLEMS, TECHNIQUES, AND THEORY 
OF ISOPLETH MAPPING* 


Catvin F. Scumip anp Earte H. MacCanne.u 
Office of Population Reseerch, University of Washington 


HIS paper is concerned with the basic problems, techniques and 
theory of isopleth mapping. The term “isopleth,” Gr. isoplethes, 
equal in quantity or number, (from isos-equal and plethos-quantity, 
number) as used in this discussion, designates one of two types of 
isoline maps in which the lines (isopleths) connect equal rates or ratios 
for specific areas. In the other type of isoline map, which is commonly 
referred to as an isometric map, lines (isometers) are drawn through 
points of equal value or intensity. In the “isopleth” map, the values are 
rates or ratios computed for areal units, such as census tracts, town- 
ships, precincts, or counties whereas in the “isometric” map the values 
are samples of absolute measurement taken at different points on a 
map. A population density map with lines showing equal densities is an 
example of the “isopleth” map, and a topographic map with lines con- 
necting a series of points of equal elevation (“isohypses,” or more com- 
monly, contour lines) is an example of the “isometric” map.! 
Although there have been several important contributions to the 
literature on problems of isopleth mapping, all have failed in varying 
degrees (1) to state certain of the problems definitively and with ref- 
erence to basic statistical principles; (2) to offer adequate solutions to 
the stated problems, and (3) to formulate clearly the relationship be- 
tween statistical theory and isopleth mapping. Obviously, the present 
paper mukes no pretense of presenting the final word on these three 
desiderata, but it does attempt to make more clear and explicit certain 
of the basic problems, solutions, and statistical theory of isopleth 
mapping.? 





* Acknowledgment is made to Lloyd Kirry and Gloria Austin for their assistance in drafting the 
charts. The authors also are indebted to their colleagues, Z. William Birnbaum and Douglas G. Chap- 
man of the Department of Mathematics, John Sherman of the Department of Geography, and Stuart C. 
Dodd and George A. Lundberg of the Department of Sociology for a critical reading of the manuscript. 

1 It will be found that the terminology used in connection with both “isopleth” and “isometric” 
maps as designated above, is inconsistent, if not confusing. For example, instead of the generic term, 
“isoline” many cartographers prefer “isogram” (equal-line), “isopleth” (equal-measure), “isorithm” 
(equal-number), and a few, “isometric” and “isontic.” 

2 Perhaps the most relevant and significant discussions in this connection are Fr. Uhorczak, 
“Metoda Isarytmiczna W Mapach,” Polski Przeglad Kartograficyny Tom IV (Grudzien 1929), 95-124; 
and J. Ross Mackay, “Some problems and techniques in isopleth mapping,” Economic Geography, 
XXVII (January, 1951), 1-9. 
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SUMMARY OF PROBLEMS OF ISOPLETH MAPPING 


The major problems of isopleth mapping may be summarized as 
follows: 

1. What influence does the size of base areas have on isopleths? Is 

there an “optimum-sized” base area for isopleth maps? 

2. What is the most logical “control point” and how is it located? 

3. What is the logic and technique of determining class-intervals for 

isopleths? 

4. What interpolation technique is most appropriate for isopleth 
maps? What factors influence the reliability of interpolation? 
Are there any principles derivable from statistical theory that 
might be useful in the formulation of a sound rationale of isopleth 
mapping as well as in providing a basis for clarifying and perhaps 
solving some of the problems discussed in this paper? 


or 


IMPLICATIONS OF SIZE OF BASE AREA 


Since the data for isopleth maps are based on defined areas, their 
size and shape exert a pronounced influence on the reliability, compar- 
ability, significance, and general appearance of the isopleth map. If the 
base areas are relatively large, meaningful variations are masked and 
the isopleths are extremely general. On the other hand, if the areas are 
relatively small, chance and possible meaningless variations in the 
data will be recorded as myriads of tiny “islands” or “peaks” on the 
isopleth map.® 

Sometimes the cartographer has a limited choice of areas since the 
data may be classified in two, three, or even four ways. For example, 
data for a city may be classified on the basis of blocks, enumeration 
districts, census tracts or community areas, and for a state, into enu- 
meration districts or precincts, census divisions or townships, and 
counties. The size of the territorial units in relation to the entire area to 


* These “islands” or “peaks” may be meaningless in two ways: First, if the variation within the 
blocks, or other small areas, is not significantly less than the variation between these areas and some 
larger areas, such as enumeration districts, the variations which are recorded on the basis of the small 
areas are not statistically significant. Second, although the variations indicated by small areas are sta- 
tistically significant, it may be unnecessary or even confusing for many purposes to use the smallest sig- 
nificant base areas. 

Some statistical technique, such as analysis of variance, will be found useful in determining the 
minimum significant size of base areas, even though it may not be desirable to use such small areas in 
every case. The base areas of blocks, enumeration districts and census tracts used in Figures 1, 2 and 3 
were analyzed by analysis of variance techniques. The variation within about 80 per cent of the blocks 
was found to be significantly less than the variation between the blocks within their respective enumera- 
tion districts. The variation within all but one of the enumeration districts was found to be significantly 
less than the variation between enumeration districts within their respective census tracts. The varia- 
tion within the census tracts was found to be significantly less than the variation for the whole segment 
included on the maps. The criterion of the significance as used here was defined at the five per cent level. 
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be mapped, the nature of the data, and particularly the purpose at 
hand should be the major determinants in selecting a particular type of 
area. For example, the authors have constructed more than 15 isopleth 
maps for the City of Chicago (approximately 207 square miles of land 
area) based on approximately 1,000 census tracts in 1940. If block 
data or enumeration data had been used, there would have been many 
thousands of little islands which would have resulted in excessive and 
unnecessary detail. On the other hand, community areas with their 
relatively large size and heterogeneity would have concealed important 
and significant patterns. 

A comparison of three isopleth maps for the northern and eastern 
segment of Seattle based on (1) city blocks, (2) enumeration districts, 
and (3) census tracts respectively are shown in Figures 1, 2, and 3. The 
map comprises approximately 20 square miles of land area out of a 
total land area of 70.8 square miles for the City of Seattle in 1950. In 
the area covered by these maps there are approximately 2,400 blocks, 
290 enumeration districts, and 36 census tracts. After carefully examin- 
ing Figures 1, 2, and 3 it is difficult, if not impossible, to say which is 
“best,” since any judgment of this kind is usually related to the purpose 
for which the map is designed. However, for certain purposes it can be 
said that the isopleths based on block data provide too much detail, 
while those based on census tract data are overgeneralized. In this con- 
nection, the size of the base unit is always relative to the over-all area 
included on the map. 

It must not be overlooked that the problem of size and shape of the 
areal units has another implication with respect to isopleth mapping. 
When the base areas vary widely in size a loss in measurability of values 
occurs in the interpolation process. If one control point is based on a 
relatively large area, and the other on a small area, the interpolation of 
values between them is subject to errors of interpretation which cannot 
be controlled statistically. If it is impossible or impracticable to use 
base areas of approximately the same size and shape, extreme caution 
should be followed in interpreting the resulting isopleths. 


LOCATION OF CONTROL POINT 


In constructing an isopleth map a point is used to represent each 
areal unit. This point which is called the control point of the area, must 
be accurately located according to some specific assumption for each 
areal unit for which a rate or ratio has been derived. If the unit is sym- 
metrical in shape and the distribution of the phenomenon is relatively 
uniform, the geographical center naturally would be the control point. 
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ISOPLETH MAP BASED ON BLOCK DATA 
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Figure 1 
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ISOPLETH MAP BASED ON ENUMERATION DISTRICT DATA 
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ISOPLETH MAP BASED ON CENSUS TRACT DATA 


MEAN VALUE OF ONE -DWELLING-UNIT STRUCTURES 
NORTHERN AND EASTERN SEGMENT OF SEATTLE: 1950 
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In actuality, census tracts, precincts, counties, and other areas for 
which statistics are compiled are seldom symmetrical, and the patterns 
of distribution of most economic, social, and other phenomena are 
uneven. The problem is to determine the most typical or representative 
point in each area. This point in any type of area, with even or uneven 
distribution of some characteristic, is the center of gravity or pivot 
point where the distribution would balance if it were supported by a 
rigid and weightless plane.‘ 

Occasionally, in unusually shaped areas and uneven distributions the 
control point may be located outside the boundaries of area. 


DETERMINING SIZE OF INTERVALS 


The third consideration is the choice of intervals for isopleths. The 
isopleth interval may be based on either a geometric or arithmetic 
progression or some division or multiple of a number such as, 5 or 10, 
resulting, for example, in line values, 0, 5, 10, 25, 50, 100, 500, or 1,000. 
The size of the interval should be adapted to each map, particularly in 
relation to its purpose as well as to the type of distribution, reliability, 
and other characteristics of the data. If the intervals are too large, the 
results may be an overgeneralized and somewhat meaningless map. On 
the other hand, if the isopleths are plotted in accordance with small 
class intervals or when a map has widely separated control points, an 
unwarranted impression of precision is conveyed. The value of each 
isoline is indicated on the map with an appropriate number and/or by a 
hatching scheme, the significance of which is included in a supple- 
mentary legend.® 


INTERPOLATION PROBLEMS AND TECHNIQUES 


One of the most frequently discussed problems in isopleth mapping 
pertains to interpolation procedure. Although more or less complicated 
mathematical interpolation techniques have been proposed,® simple 





4 In mathematical terminology this point is known as the centroid of the base area. “Centroid” 
may be de‘ined as the point at which an area must be supported in order to balance perfectly if the 
area itself is a weightless plane, and the elements described by the centroid are distributed over it. In 
physics this point is known as the “first moment” or “center of gravity” and may be found mathemati- 
cally by locating the mean distance of all the elements from any pair of arbitrarily chosen perpendicular 
axes. Cf., Mackay, op. cit. and Uhoreczak, op. cit. 

5 J, W. Alexander and G. A. Zahorchak, “Population-density maps of the United States: techniques 
and patterns,” Geographical Review, 33 (1943), 458-60; J. Ross Mackay, op. cit. 

6 Uhorczak devotes most of his paper (op. cit.) to the problem of interpolation from a non-linear 
point of view. He applies logarithmic and geometric techniques in order to maintain uniformity of dis- 
tribution within the areal subdivisions of 2 map. Uhorezak assumes that the maintenance of the uni- 
formity of distribution within the b»se srens is the major consideration in constructing a statistically 
measurable isopleth map. The present writers disagree with this assumption. Rather than uniformity of 
distribution it can be demonstrated empirically that values of most characteristics tend to approach 
the average of adjacent areas at the boundaries between the areas. 
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linear interpolation seems to be most appropriate as judged by reliabil- 
ity of results, to say nothing of the additional time required for the more 
complicated techniques.’ 

Although linear interpolation in itself is not difficult, sometimes con- 
ditions arise which create serious problems. The most common problem 
of this kind occurs whenever more than three centers are so located that 
interpolation axes cannot be drawn from each center to all the other 

















THESE ISOPLETHS , THESE ISOPLETHS 
ASSUME. THE CEN- “ASSUME THE CEN- 
TER VALUE IS 10 \ TER VALUE IS 20 


[ \ 





FIGURE 4 


surrounding centers. This condition arises when areas are joined at 
their corners with no common boundary. The problem is particularly 
difficult when opposite pairs of unconnected centers have similar values 
which contrast with other opposite pairs. Even when the direction of 
interpolation is not ambiguous the amount is always undefined under 
this condition (Figure 4). 

In order to avoid this ambiguity it is necessary that the interpolation 
axes form a network of triangles over the map. This triangular pattern 
will occur if the corners of adjacent areas are not joined. In other words 
all adjacent areas must have a common boundary and not merely touch 
at the corners. This arrangement of fields is illustrated by Figures 5 
" and 9. 





7 For a discussion of linear interpolation as applied to isopleth mapping, see Calvin F. Schmid, 
Handbook of Graphic Presentation, Ronald Press, New York, 1954, 212-19, 
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Figure 5 


This principle of interpolation follows from the fact that the number 
of points of equal interpolation value around a polygon is limited by the 
number of sides. There is always an even number of equal-value inter- 
polation points around a polygon and there can be no more such points 
than the number of sides of the polygon. The interpolation axes always 
form polygons, and when an isoline enters a polygon by crossing a point 
having a given value on one of its sides, there should be one and only 
one other point—on one other side—by which the line could leave. 
Triangles fulfill this requirement. Four or more sides may allow three, 
five, or more possible points at which the isoline may pass out of the 
polygon (Figure 6). Fortunately, for practical purposes, whenever 
adjacent fields have segments of common boundaries the three sided, 

or non-ambiguous, triangular network of interpolation axes results. 
Whenever practicable, it is advisable then that a “staggered” grid of 
fields be developed over which to locate centers. 
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FIGURE 6 
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STATISTICAL THEORY AND ISOPLETH MAPPING 


In attempting to clarify as well as to formulate solutions to the prob- 
lems outlined in the foregoing paragraphs, it would seem particularly 
appropriate to examine the basic statistical assumptions and implica- 
tions of isopleth mapping techniques. 

Broadly speaking, the statistical map, including the cross-hatched as 
well as the isopleth map, is related to frequency distributions. In this 
general comparison, the cross-hatched area unit map is analogous to the 
histogram and the isopleth map, to the frequency polygon and 
smoothed frequency curve. 

The map, however, is a three-dimensional representation. Since the 
map must be presented on a flat surface—excluding the use of three- 
dimensional models—and the two areal dimensions are given a surface 
orientation, the third or characteristic dimension is portrayed as a 
vertical projection on the surface. The pattern map of areal subdi- 
visions can be considered then, to be the vertical projection of the sur- 
faces of the tops of prismatic volumes over an area. If presented in 
isometric projection rather than vertical projection, the pattern map 
would appear as a block diagram (Figure 7 A). 

The same assumptions and restrictions that apply to the histogram 
also apply to the pattern subdivision map. In the histogram it must be 
assumed that the frequency within a given class interval is uniform 
over the entire interval. This assumption is reflected by the horizon- 
tality of the line across the interval of a histogram at the height above 
the base that corresponds to this assumed uniform value. In the same 
way, it must be assumed that the distribution of a characteristic on a 
pattern map is uniform over the entire area of a subdivision in order to 
justify the use of a uniform hatching or color pattern. 

The frequency represented by a histogram is proportional to the area 
under the curve (curve is used here in the mathematical sense, referring 
to the vertical and horizontal “steps” of the histogram). If the class 
intervals are not equal, the height of the “steps” cannot be proportional 
to the frequencies of the several class intervals since in a histogram the 
height of a column divided by the length of its base is proportional to 
its frequency. It follows that the value of a subarea on a pattern map 
is proportional to the volume of a prismatic column under the subarea. 
Furthermore, unless all the subareas are of equal size the heights of the 
columns above the base plane cannot be proportional to the values of 
the subareas. In actual practice of course, the subareas are not usually 
equal, and accordingly the height is proportional to the value divided 
by the base area. 
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BLOCK DIAGRAM AND STEREOGRAM 


BASED ON DENSITY OF ONE-DWELLING-UNIT STRUCTURES 
ON ONE-FOURTH SQUARE MILE STAGGERED GRID 
EASTERN PART OF SEATTLE: 1950 
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Although these assumptions are useful for certain purposes, there is 
another approach which approximates more closely most empirical con- 
ditions. This approach is based on the logic and assumptions of the 
frequency polygon rather than the histogram. The area under a fre- 
quency polygon which has been constructed by connecting the mid- 
points of the tops of histogram rectangles is approximately equal to the 
area under the histogram for any class interval with the exception of a 
modal or antimodal interval.® 

This relationship is also applicable to rate maps. It will be observed 
that the volume under the stereogram surface in Figure 7 B, for ex- 
ample, made up of planes connecting centroidal points on the top sur- 
faces of the prismatic block diagram, is approximately equal to the 
volume of the block diagram, and with the same exceptions of modal or 
antimodal intervals. It may be noted here that for each antimode there 
must be two modes (assuming a recession in values at the extremes) 
and that the biases tend to be compensating up to the limit of a single 
mode or antimode. A multi-modal curve or surface is no more biased 
than a unimodal distribution. It follows that little distortion cf the 
representation of values—that is, statistical measurability—accrues if 
a block diagram is transformed into a stereogram by connecting the 
centroids of adjacent subareas with straight lines forming sloping 
planes. In addition, under the assumption that empirical agreement 
with data is improved by interpolation between centers, the surface of 
the stereogram is a better representation of the data than is the block 
diagram. 

Although the stereogram devised on the basis of rates has no “real” 
or observable counterpart—the “elevations” being hypothetical con- 
structs designed to depict the varying rates and not the result of meas- 
urements at particular points—the hypothetical elevations on the sur- 
face of the stereogram may be represented on a vertical projection by a 
series of curves representing equal “elevations,” or actually, equal rates. 
The stereogram surface, if it is not smoothed, is represented by a con- 
figuration of intersecting planes (Figure 7 B). Equal elevation is repre- 
sented by a plane parallel to the base of the stereogram, and such a 
plane will cut the stereogram surface in a series of straight line seg- 
ments which form polygons in the vertical projection (Figure 8 A). 





8 The mode is 8 maximum. That is, any point, any area of uniform height, any histogram interval, 
or any other graphically described position which is surrounded on all sides by positions of lesser value 
is called a mode. An antimode is a minimum in the same sense. An antimode is surrounded on all sides 
by greater values. Distributions may have any number of modes and antimodes. In two dimensional 
presentation, if the data recede at both extremes there is always one more mode than antimodes. If the 
data increase at the extremes there is one more antimode than modes, and if the data increase at one 
extreme and recede at the other there are the same number of modes as antimodes. In three dimensional 
presentation the same relationship between modes and antimodes usually obtains. 
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STEREOGRAM IN TOP VIEW AND ISOPLETH MAP 
BASED ON DENSITY OF ONE-DWELLING-UNIT STRUCTURES 
ON ONE-FOURTH SQUARE MILE STAGGERED GRID 
EASTERN PART OF SEATTLE: 1950 








(A) STEREOGRAM 

TOP VIEW 
THIS IS AN ORTHOGRAPHIC PRO- 
JECTION SHOWING THE TOP VIEW 
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7B. THE DASH-DOT LINES REPRE- 
SENT VALUES OF DENSITY PER 
SQUARE MILE OF ONE-DWELLING- 
UNIT STRUCTURES BASED ON ONE- 
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(8) ISOPLETH MAP 
THE ISOPLETHS ARE SMOOTHED 
CURVES FOLLOWING THE EQUAL 
DENSITY LINES OF THE STEREO- 
GRAM SHOWN IN FIGURE 7B AND 
IN “A” ABOVE. 
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These polygons can be smoothed by precise techniques, but errors 
involved in the various assumptions leading to their construction are 
sufficiently great to make precise smoothing comparable to computa- 
tions beyond the limit of significant digits in elementary mathematics.°® 
(Figure 8 B.) In order to obviate a false appearance of accuracy and at 
the same time reduce chance fluctuations in the data, freehand smooth- 
ing can be used. As a test: of the reliability, four draftsmen independ- 
ently smoothed freehand a given polygon and the results were com- 
pared. The four curves drawn by the draftsmen were superimposed to 
show general agreement or disagreement and the areas measured with a 
planimeter. In the light of this experiment, it was found that the varia- 
tions were relatively small involving a maximum difference in area of 
less than one per cent. 

The vertical dimension on an isopleth map is assumed to vary 
linearly from isoline to isoline and on the basis of this assumption, ap- 
proximate values for any desired area of the map can be computed. By 
using a planimeter the desired areas can be measured, and average 
heights determined. The value for a specified area is the area multiplied 
by its average height in terms of the isolines that cover it. 

Frequently, the characteristic portrayed on an isopleth map is ex- 
pressed as a ratio of two non-areal units, such as dollars per person 
(income), persons per household, or some other combination of social 
or economic factors. When isoline maps are used to delimit these higher- 
ordered characteristics, the immediate relationship to area is obscured. 
The fact that a map is used implies some inherent association between 
the characteristics and their distribution over an area. The association 
can be expressed mathematica!!y and hence manipulated statistically 
if each characteristic is related separately to the area. Since reciprocal 
values for any characteristic do not alter the isopleth placement, any 
line that represents a value per unit area can be said to represent also a 
unit area per unit value. That is, a line which represents 100 persons per 
square mile can also be said to represent one one-hundredths square 
miles per person. Multiplication of isopleth line values is accomplished 
by superimposing the corresponding isolines on the same map. If one 
of the sets of isolines is then expressed as a reciprocal relationship with 
area, the superimposed lines will cancel out the areal unit, leaving the 
ratio as an expression between characteristics. To illustrate, a map may 
be prepared with isolines showing single unit dwelling structure per 





* The contour polygens need not be drawn in actual practice. The vertices can be connected with a 
smooth curve without first completing the edges of the polygons, which are not used in the final drawing 
anyway. 
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square mile. A corresponding map might show number of dollars valua- 
tion of one-dwelling-unit structures per square mile. Both of these 
ratios are compatible with the usual concept of areal based, three- 
dimensional isopleth maps. If one of them, let us say, dwelling units per 
square mile, is designated as a reciprocal—square miles per dwelling 
unit—the intersection of an isopleth describing one-hundred thousand 
dollars per square mile with an isopleth indicating .085 square miles 
per dwelling unit represents 100,000 times .085 or $8,500 per one-dwell- 
ing-unit structure at that point. It should be pointed out that different 
sized areal bases for calculating the location of isolines will give differ- 
ent patterns for the same data. In order that the approximation be as 
statistically measurable as possible, it is important that the areas used 
to compute the isolines for both characteristics be the same size and 
shape. The isopleth map showing mean value of one-dwelling-unit 
structures in the Eastern Part of Seattle in 1950 shown in Figure 9 was 
constructed in this way. 


TECHNIQUE FOR CONSTRUCTING A STATISTICALLY 
VALID ISOPLETH MAP 


The value for any center or control point has significance only in 
relation to the size and shape of the area which it represents. If points 
along an interpolation axis are to be determined between centers repre- 
senting different sized and shaped base areas, or if a center is de- 
termined by the intersection of two isolines which represent different 
sized or shaped base areas, the resulting points cannot be interpreted, 
since the base with which they must be associated cannot be desig- 
nated. When unequal or non-congruent base areas are used in con- 
structing isopleths, the resulting map cannot be interpreted precisely 
nor considered statistically measurable. Therefore, a map of this kind 
can be interpreted only if it is assumed to be similar to a map con- 
structed over congruent base areas. 

In order to construct an “ideal” isopleth map, the characteristic or 
characteristics under consideration must be exactly located on a map 
of suitable size and projection. One method by which a complete dis- 
tribution of the characteristics can be exactly located is by means of a 
dot map. Also, in order to provide an infinite number of centers or con- 
trol points, it must be possible to apply without restriction as to loca- 
tion, a constant base area to the map. The control points thus derived 
represent centroids and values with reference to the distribution of the 
characteristics within the base area for any given points. If a very 
large number of such centers is located, equal values will appear as a 


+ Sac ee Riemer eT MIEN EET TEP AMER bg RRR RII cor — 


OI NRTA NRC Nee mn 











ISOPLETH MAPPING 235 





ES SD 








lua- 


ese ISOPLETH MAP BASED ON 


ree- ONE-FOURTH SQUARE MILE STAGGERED GRID 


er 
ling MEAN VALUE OF ONE-DWELLING-UNIT STRUCTURES 


and 9 EASTERN PART OF SEATTLE: 1950 
Liles 19 
vell- 
rent 
ffer- 
e as 
ised 
and 
unit 
was 






















Na Ry a is ETA RIOD 





PS rs 
VOLUNTEER + 
roth x oS 
© |? parw = 
shy ’ 








pone ee 


y in 
ints 
pre- 


“| 














rent 
ted, fF 
asig- 
con- 
isely 
kind 
con- 


WASHINGTON 








Levee 

















ISOLINES INDICATE 
AVERAGE VALUES 
PER STRUCTURE IN 
THOUSANOS OF DOLLARS. 


ic or 
map 











of a 
con- 

















‘is 2. 2 «2 
|oca- SCALE IN MILES 


ived 
f the 
very 
as a 














FIGURE 9 











236 AMERICAN STATISTICAL ASSOCIATION JOURNAL, MARCH 1955 


continuous series of points, or a curve, on the map. Such curves are 
the “true” isopleths of the characteristic on the projection used with 
respect to the base area from which they are computed. This “ideal” 
isopleth map can be approached by a technique which shall be referred 
to as the “floating grid technique,” which is applied to a dot map. This 
technique will yield an isopleth map which is as empirically accurate 
and statistically measurable as is the dot map over which it is con- 
structed. Since the dots are located to represent the total distribution, 
no assumption as to the uniformity of the distribution is necessary, 
Furthermore, since intermediate points can be determined readily by 
slight shifts of the “floating grid,” no ambiguity of points will result, 
nor are any assumptions needed with respect to interpolation. In prac- 
tice, of course, an infinite number of centers is not located, and inter- 
polation is used; but a sufficiently large number of centers is located so 
that extreme generality and excessive subjective determination is 
eliminated, and questionable interpolations are analyzed by precise 
methods. 

To apply the “floating-grid technique,” a suitable base map is pre- 
pared on which the characteristic under consideration is located as ac- 
curately as possible with small, uniform dots. In the case of higher- 
order isopleth maps in which more than one characteristic is considered 
—such as persons per house or divorces per unit population—each of 
the characteristics must be located separately on the map, or separate 
base maps must be prepared for each characteristic. A suitable base 
area is chosen with respect to the data under consideration. For some 
data, areas as small as a single city block might be suitable, whereas for 
other data such small areas might result in superfluous and irrelevant 
detail. For example, with respect to population density, a vacant lot of 
the size of a city block or less, for most purposes might be disregarded 
or considered to be a chance fluctuation. However, several such vacant 
lots within a small district, or large tracts of vacant property in any 
area, are usually quite significant in this respect. If this is the case, base 
areas for population density isopleths should not be as small as a city 
block, but neither should they be so large that adjacent, heavily popu- 
lated areas will obscure a series of vacant blocks or large vacant tracts. 
A base area ranging from four blocks square to a quarter of a square 
mile might be acceptable for an urban population density isopleth 
map. For states or larger regions, however, base areas of ten to one- 
hundred square miles might be appropriate. 

From the point of view of interpretation of the isopleth map, a regu- 
lar geometric form is desirable for the base area. The orientation of the 
base should not affect the determination of the center or control point. 
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In this respect any regular geometric shape—circle, triangle, pentagon, 
hexagon, etc.—is equally acceptable, but due to special and simple 
means of determining the centroid of mass within a square, the square 
base is recommended for practical reasons. A convenient, practical 
floating grid is made by drawing on some transparent material, such as 
transparent acetate, a square of the size determined as discussed above, 
and scaled to the proportions of the dot map over which it is to be 
applied. The square is then divided into quarters with lines parallel to 
its sides and each quarter marked with a fine grid of from five to ten 
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squares in each direction (Figure 10 A). The grid may be used in the 
same way that a rectilinear coordinate system is used to determine 
centroids. Each dot which falls within the grid is given an X and Y 
value according to its location on the coordinate system, and the mean, 
or average, X, Y point is the centroid of that particular base area and 
has a value equal to the number of dots included within the grid. 

For example, in Figure 11, four dots appear within the grid. The dot 
marked “A” is three units up, or positive from the center and four units 
to the right, or positive from the center. “B” is two units to the left, or 
negative from the center and three units up, or positive from the 
center, etc. The sum of the left and right, or X direction distances is 
zero, and the sum of the up and down, or Y direction distances is plus 4. 
Since there are four dots, the mean or average distance left and right 
is 0/4 or zero, and the mean distance up and down is 4/4 or plus one. 
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The centroid, then, is located zero units left and right and one positive 
or up unit from the center and is given a value of four. Shifting the grid 
to a position which would include only three of these points, if no new 
points were “picked up” in the area by the shift, would give a new 
centroid position with a value of three. In this manner as many 
centroids or centers can be located and evaluated as are necessary to 
generate the desired number of isopleth lines over the map. 

No centers can be located on a map within a band approximately 
one half the width of the base grid at the limits of the map area. This 
band will extend along all natural boundaries as well as along vacant 
areas and areas adjacent to the last recorded data on the dot map. In 
the case of arbitrarily defined limits, this difficulty can be solved readily 
by including enough additional territory on the dot map so that it will 
be possible to complete the isopleths to the desired limits. In order to 
extend the isolines up to lake shores or to other natural boundaries, it is 
necessary to proceed according to some arbitrary assumption concern- 
ing the data being depicted. First, if it is assumed that the density 
drops to zero rather quickly at the natural boundary, no isoline can be 
allowed to terminate at that boundary. For example, all lines approach- 
ing a lake shore would tend to run parallel to and near the shore line. 
A second, and perhaps more acceptable assumption would be to extend 
the values to the natural boundary and then terminate them abruptly. 
This assumption requires the extrapolation of values across the band in 
which centers cannot be located. Vacant property can be handled in 
exactly the same way as lakes, bays and rivers. This procedure was used 
in the construction of the isoline maps in this article. 

No attempt should be made to locate an isoline which has a value 
based on less than three dots. For example, if each dot represents 100 
persons, no isoline should be drawn to represent less than 300 persons 
per unit area where the unit area is the floating-grid size. If the grid is 
one-quarter square miles and the dots represent 100 persons, the mini- 
mum isoline value should be 1,200 persons per square mile. This princi- 
ple has been violated in the case of the isolines designated / in Figure 
10 B, and therefore the isopleth indicating 1,000 persons per square 
mile on this map is noi as reliable as the other isopleths. When a large 
number of centers and their values have been located, any intermediate 
points and values can be determined by linear interpolation with rela- 
tively small loss in accuracy, especially if all.unusual patterns of dots 
on the base map have been covered carefully by measured values. Con- 
necting equal-valued centers by smooth, continuous curves can be ac- 
complished readily by a draftsman without resort to mathematical or 
mechanical smoothing techniques and with high reliability. 











THE 1955 ECONOMIC REPORT OF THE PRESIDENT* 
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I. THE CONTENT OF THE REPORT 


LL who are concerned with the role of economics in formulating 
AN public policy, as well as the manifestations of such policy in the 
ebb and flow of economic conditions, should welcome the 1955 Eco- 
nomic Report of the President as a clear statement of principles guiding 
the present Administration in dealing with economic activity. Although 
other factors have undoubtedly affected some policy decisions, it is 
clear that economic analysis has played an important role. The Report 
is oriented toward an analysis of recent business developments and the 
outlook for 1955 with policy prescriptions for facilitating the growth of 
private enterprise and increasing the stability of our economy. It is 
well written and, consequently, can be understood by the lay reader 
as well as by the technical economist. 

The Report consists of three chapters and four appendices. Chapter 1, 
“The Expansive Power of the American Economy,” includes a discus- 
sion of Federal Government obligations under the Employment Act, 
factors contributing to the achievement of growth potentials, and 
Government action taken during 1954 to build a stronger economy. 
Chapter 2, “A Year of Economic Transition,” includes a detailed anal- 
ysis of the nature of the business contraction in 1954, reasons for the 
mildness of the contraction, Federal Government policies and their 
effects, and a listing of lessons from experience and guides to the future. 

Chapter 3, “Program for Sustained Economic Progress,” is devoted 
to a discussion of Government policies designed to promote long-term 
growth. The emphasis on growth rather than stabilization is due to the 
belief that the economy is now “undergoing a cumulative expansion of 
some strength” which is expected to lead to a “high and satisfactory 
level of employment and production within the current year” (1955, 
p. 48). This chapter deals with policies designed to promote the spirit 
of enterprise, encourage foreign trade and investment, improve social 
security measures, improve public assets such as public works and 
natural resources, and increase the stability of a growing economy. 





* A review article on The Economic Report of the President, U.S. Government Printing Office, Jan- 
uary 1955. Pp. x, 203. $0.75. 
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The appendices include a summary of economic recommendations of 
the President, a factual documentation of the analytical account of 
economic developments in 1954 given in Chapter 2, a report on the 
activities of the Council during 1954, and a set of statistical tables re- 
lating to income, employment and production. 


II. EVALUATION 


A. Comparative Analysis 

In order to appreciate fully the significant changes that have oc- 
curred in the President’s Economic Report in recent years, it is useful 
to compare the current report with the last one written by the previous 
Administration in January, 1953. It should be recognized that the cur- 
rent Administration has benefited from the experience of the carlier 
work so that it would be disappointing if some improvement had not 
been made. Although opinion on the extent of the betterments may 
differ, depending partially on the economic philosophy of the reader, it 
is the opinion of the reviewer that important improvements have been 
accomplished. 

Perhaps the most significant change is in the economic philosophy 
underlying the two Reports. Although the current Report does not 
recommend a return to laissez-faire, it places considerably more re- 
liance on private initiative for achieving the goal of maximizing the 
satisfaction of human wants, and less emphasis on Government inter- 
vention. The following basic economic tenets found in the current Re- 
port outline the Economic philosophy of this Administration. “First, 
competitive markets, rather than Governmental directives, are as a 
rule the most efficient instruments for organizing production and con- 
sumption. Second, a free economy has great capacity to generate jobs 
and incomes if a feeling of confidence in the economic future is widely 
shared by investors, workers, businessmen, farmers, and consumers. 
Third, the Federal Government creates an atmosphere favorable to 
economic activity when it encourages private initiative, curbs monopo- 
listic tendencies, whether of business or labor, avoids encroachment on 
the private sector of the economy, and carries out as much of its own 
work as is practicable through private enterprise. Fourth, the Federal 
Government generates confidence when it restrains tendencies toward 
recession or inflation, and does this by relying largely on indirect means 
of influencing private behavior rather than by direct controls over 
people, industries, and markets. Fifth, the Federal Government con- 
tributes to economic growth when it takes its part, at the side of the 
States, in promoting scientific research and in providing public facili- 
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ties, such as highways, hospitals, harbors, and educational institutions, 
on which the expansion of the private economy heavily rests. Sixth, the 
Federal Government strengthens the foundations of the economy when 
it widens opportunity for its less fortunate citizens and, working in co- 
operation with the States and localities, helps individuals to cope with 
the hazards of unemployment, illness, old age, and blighted neighbor- 
hoods” (1955, p. 2). Although the preceding Administration would 
probably concur in the above principles, the economic policies pursued 
by the two groups indicate a greater willingness on the part of the 
present Administration to implement this philosophy with concrete 
action. As indicated by points four and five, this Report places as much 
if not more emphasis on non-defense spending on public facilities and 
social security measures as did the previous one. 

Both Reports placed considerable emphasis upon the responsibility 
placed on the Government by the Full Employment Act for stabilizing 
the economy. Perhaps the greatest difference in emphasis is the greater 
reliance placed by the present Administration on a flexible monetary 
policy, both for restraining an inflation as well as for promoting eco- 
nomic recovery. In the January, 1953 Report the following statement 
was made: “The first objective of credit policy is to assist production; 
stabilization is an associated purpose... . The prime role of credit in 
production is, of course, not to create demand, but to implement it; its 
job is to facilitate rather than generate. It is possible for credit policy 
to achieve a victory over inflation by withholding funds from those de- 
siring to buy, although such action may thwart the first objective of 
the credit mechanism. At the other extreme, in a deflationary situation, 
it is hardly to be expected that credit policy alone can arouse business 
from recession by making funds available to finance demands which do 
not exist” (1953, pp. 118, 119). 

In contrast, the current Report states that the special character of the 
recent actions of the Federal Government to stimulate the economy 
“was their promptness and the heavy reliance on monetary policies and 
tax reductions. ... Had it not been for the increased availability of 
credit and the easing of terms, the fast pace of residential, commercial 
and state and local construction, which did so much to stabilize the 
economy during the past year, would not have been attained” (1955, 
pp. 20, 22). The present Administration apparently believes that a 
flexible monetary policy can generate as well as facilitate demand. Ex- 
perience with the “active easy” monetary policy of the Federal Re- 
serve Board during the last year appears to substantiate that point of 
view. 
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Both Reports emphasized the importance of fiscal policy as a stabiliz- 
ing device, including both tax and expenditure adjustments. The cur- 
rent emphasis on increased spending on public works is due to a desire 
to promote economic growth rather than to use these expenditures for 
stabilization purposes. It is contended in the present Report, however, 
that the first line of defense against recession is a strong growth 
economy. Furthermore, the present Report argues that although Gov- 
ernment can do much to moderate economic fluctuations, there is no 
reason to believe that they can be completely eliminated. One gets the 
impression from reading the 1953 Report that the previous Administra- 
tion felt that all downward fluctuations in business activity should and 
could be eliminated. Conversely, the present Administration appears 
more concerned about preventing inflation than was the former group, 
and strongly recommends use of both monetary and fiscal policies to 
accomplish this objective. 

The 1955 Report continues the useful earlier practice of using well 
designed charts for illustrating the written material. It is unfortunate 
that tabular data on the interesting chart on output per man-hour in 
major industries is omitted (1955, p. 5). One extremely useful innova- 
tion in the current Report is a tabular reconciliation of Federal Gov- 
ernment receipts and expenditures from the national income accounts, 
the consolidated cash budget and the conventional budget for fiscal 
years 1952-54. As in earlier Reports, the current edition presents esti- 
mates of the cash and conventional budget for the current and next 
fiscal years. Since most business forecasts are not made on a Federal 
fiscal year basis, it would be desirable to have these estimates available 
on a quarterly or calendar basis. Also, the inclusion of estimates of 
future Federal receipts and expenditures on a national income account 
basis would be useful for forecasting purposes. 

One of the most significant changes in the use of economic analysis 
by the two Administrations under consideration is reflected in the 
changed status of the Council of Economic Advisers. The information 
provided in the 1954 Report on the “Reorganization of the Council” 
and in the 1955 Report on “Activities of the Council” indicates a sig- 
nificant improvement in the effective use of staff advice on economic 
matters. 

In commenting on his tenure as Chairman of the Council of Eco- 
nomic Advisers, Edwin G. Nourse made the following statement: 
“Causes for the decline of the Council of Economic Advisers are not far 
to seek. The President [Truman] did not in his initial appointments 
succeed in finding three economists of the stature needed for the task; 
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he did not accord them the status in the Executive Office requisite for 
success; he did not make effective use of them or influence his official 
family to do so; he did not establish a confidential character for their 
advisory service or recognize a distinction between economic service 
and political involvement... . The general verdict seems to be “Too 
much politics and not enough economics. So interpreted and so 
operated, this device of the Employment Act is at best superfluous and 
at worst mischievous.’”! 

In the 1955 Report, light is shed on how the Council is now being 
used. “In its relation to the President, the Council functions in the 
economic realm in many respects as the Joint Chiefs of the Staff func- 
tion in military matters. . . . The Council givesits undivided attention 
to analyzing how the entire economy is faring, to exploring ways and 
means of adding to its strength, and to advising the President on ap- 
propriate economic policies” (1955, p. 129). 

A further indication of how the Council is being used is found in the 
following statement: “A representative of the Council, generally the 
Chairman, reported personally to the President on economic matters 
once a week, sometimes more often. A representative of the Council 
also appeared regularly at Cabinet meetings to present the Council’s 
thinking about the state of the economy and ways of dealing with the 
changing economic situation” (1955, p. 129). Mr. Nourse indicated that 
“On only half a dozen occasions while I was on the Council were we 
invited to sit in on a regular or special meeting of Cabinet officers . . . ” 
(p. 384) and then for purposes other than rendering economic advice. 
In addition, at the present time the Chairman of the Council of Eco- 
nomic Advisers is Chairman of an Advisory Board on Economic 
Growth and Stability made up of representatives of several depart- 
ments and agencies. This group meets weekly and “assures close liaison 
between the Council and Government agencies that have administra- 
tive responsibility for various economic programs. It also provides the 
Council with timely information and advice on a wide range of current 
economic issues” (1955, p. 131). 

In the absence of more detailed information as to how the Council is 
now used, it is impossible to compare adequately the functioning of the 
present Council with the earlier group. However, on the basis of the 
limited information available, it seems extremely likely that the current 
Council is providing useful economic advice and analysis and that for 
the first time national policies are being formulated after a careful con- 





1 Economics in the Public Service, New York: Harcourt Brace & Co. (1953), p. 454. 
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sideration of the economic consequences. If this conclusion is correct, 
it represents an important step forward in implementing the objectives 
of the Employment Act of 1946. 

The coverage of the statistical tables relating to income, employ- 
ment, and production is again expanded in the current report. The data 
are well presented and represent a convenient reference source for re- 
searchers in these areas. Detailed breakdowns are made when possible 
and the annual data usually cover a considerable span of years with 
semiannual, quarterly, or monthly data for more recent years when ap- 
propriate. 


B. Some Possible Shortcomings 


It is, of course, impossible to analyze adequately in this review each 
Government policy discussed in the President’s Report even if the re- 
viewer were competent to do so. Therefore, emphasis on possible short- 
comings will be placed only on those aspects of the report that appear 
particularly vulnerable. 

The 1955 Report properly took credit for doing an adequate analysis 
of the economic situation in early 1954. “The earlier Report set forth 
the conditions of economic progress in our country and in our times. 
By and large, the events of the intervening year have borne out the 
conclusions of that Report concerning the economic state of the Nation 
and the policies needed to promote sound economic growth” (1955, 
p. 1). The major conclusion on the current outlook of the earlier report 
was that outlays in most areas would be well maintained in the visible 
future and the adjustments then in process were not likely to initiate a 
“cumulative downward movement” of the economy (1954, p. 71). 

Although most of the 1954 estimates were of a general rather than 
specific nature, the explicit estimate of changes in Federal spending, as 
measured in the gross national product accounts, proved to be ex- 
tremely poor. The 1954 Report estimated that by mid-1954 reductions 
in Federal spending for goods and services might be about $2 billion 
below the rate at the end of 1953 (January, 1954, p. 67). It was stated 
that “Federal expenditures will continue to be a strong sustaining fac- 
tor” (1954, p. 67). The authors of the 1954 Report expected the “small 
prospective decline in Federal expenditures” to be counteracted in large 
part by a rise in state and local purchases. Actually Federal spending 
for goods and services in the second quarter, 1954, as measured by the 
GNP accounts, was down $8.5 billion from the rate in the last quarter 
of 1953 to a rate of $51.3 billion. Thus, the decline was apparently 
$6.5 billion more in that six-month period than expected. The forecast 
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six-month change was only 23.6% of the actual change or the actual 
level of Federal spending was 11.3% less than forecasted. For the entire 
year of 1954, Federal spending was $9.8 billion below the fourth 
quarter 1953 rate, and the rate of State and local spending rose only 
$1.3 billion. 

Private forecasters also generally underestimated the cut in Federal 
spending in 1954. Yet, in view of the fact the Federal Government has 
considerable control over this segment of spending, this poor forecast 
in the 1954 Report raises at least two important questions. (1) Was this 
a case of poor liaison among the Treasury, the Council, and the Presi- 
dent, or is the gross error in underestimating reductions in Federal 
spending to be charged to the Treasury? (2) If the Council (or Treas- 
ury) had properly estimated the drop in Federal spending, would the 
Administration have sponsored larger tax cuts in 1954 to assist the 
transfer of resources from the Federal to the private sector of the 
economy? Neither of these questions can be answered conclusively on 
the basis of present information, but they deserve answers by the Ad- 
ministration. 

On the basis of statements made by Treasury and Council officials 
during the past two years, it seems probable that the Council was much 
more concerned about the destabilizing effect of reducing Federal 
spending relative to taxes during a period of business decline than was 
the Treasury. Treasury officials appeared to be interested primarily in 
balancing the budget and extending the average maturity of the 
Federal debt. Although these are laudable longer run objectives, they 
were largely inconsistent with the overall objective of stabilizing the 
economy in 1954. Fortunately, from a stabilization viewpoint, the 
Treasury restricted its debt lengthening activities to 6- to 9-year ma- 
turities which were bought primarily by commercial banks. 

In listing six “lessons from experience and guides to the future,” the 
1955 Report states “that contraction may be stopped in its tracks even 
when Governmental expenditures and budget deficits are declining, 
provided effective means are taken for building confidence” (1955, p. 
22). It cannot be denied that this was accomplished in 1954, but at the 
expense of reduced economic growth and production, and substantially 
increased unemployment. It is suggested that a seventh lesson learned 
(or relearned) from the 1954 experience, is that Federal spending 
should not be reduced more than taxes during a period of business de- 
cline. It would seem that the failure to observe this principle in formu- 
lating tax policy in 1954 was largely responsible for the decline in 1954, 
since even the inventory decline was probably substantially due to the 
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sharp decrease in Federal spending without a compensating rise in the 
private sector. Although the Administration may be blameless for 
underestimating the extent of the cut in Federal spending in 1954 due 
to possible forecasting difficulties, it would seem that the 1955 Report 
is something less than completely frank in failing to emphasize the 
destabilizing effects of the Federal budget during the past year. 

The Administration’s case for a higher minimum wage of $0.90 is not 
convincing and seems somewhat inconsistent with the emphasis on free 
markets. The 1955 Report argues that since the present minimum of 
$0.75 was established “the cost of living and average hourly earnings 
have risen, providing reason for an increase in the minimum wage 
when, as at present, the economic outlook is favorable” (1955, p. 58). 
The Report points out that minimum wage laws do not get at the 
fundamental causes of low incomes or poverty and thatahigherminimum 
wage would add appreciably to the costs of certain industries, notably 
in the South. Furthermore, it states: “Nevertheless, $0.90 an hour is 
the highest minimum wage that can be economically justified in present 
circumstances. A higher minimum might well cause lower production 
and substantial unemployment in several industries, and—whether di- 
rectly or indirectly—it would probably bring generally higher prices in 
its wake” (1955, p. 58). 

Chart B-3 on page 90 of the 1955 Report shows that unemployment 
rates tended to be high in the South as of June 30, 1954. Unless very 
low wage rates in the South are normally associated with monopson- 
istic conditions, it would seem reasonable to expect that if a $0.90 
minimum wage raises wage rates in the South, unemployment rates in 
this region will be raised. To the extent that improving business condi- 
tions offset this effect, wages would probably rise in any event. Except 
in a few small isolated communities, there is little reason to believe that 
the Southern employer is immune to competitive forces. The effect 
of a minimum wage in excess of $0.90 would differ from the effects of a 
minimum wage of $0.90 only in degree, rather than kind. In summary, 
the Administration’s defense of this proposal makes it appear to regard 
the $0.90 minimum wage recommendation as yielding the maximum 
political benefit for the moderate amount of economic dislocation al- 
lowable. 

The Report’s recommendation that the lending authority of the Fed- 
eral Government for granting loans to small business “should be en- 
larged so that loans may continue to be made to small concerns that 


. cannot obtain adequate financing on reasonable terms” (1955, p. 50) 


also appears to be inconsistent with the overall emphasis on free mar- 
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kets. There is, of course, always a shortage of “adequate financing on 
reasonable terms” when the rate considered to be “reasonable” is sub- 
stantially below the rate justified by the risks involved. The very sharp 
increase in the number of new business firms in the postwar period docs 
not indicate substantially increased financial barriers to entry into 
business. There is little reason to believe that Government officials are 
better qualified for determining credit worthiness than the local banker 
or other sources of finance which have detailed information relating to 
the nature of the prospective borrower. The granting of funds by 
Government at low rates to finance businesses that are otherwise denied 
such funds has the effect of giving control over resources to relatively 
inefficient producers. In addition, this program may have a destabiliz- 
ing effect on business activity as the demand for loans at relatively low 
rates of interest tends to rise during periods of high business activity 
and fall during periods of declining business. It is difficult to see how 
this recommendation could have been based primarily on economic 
analysis. 

It has been suggested that the present Administration appears to be 
much more concerned about the evil effects of inflation than about the 
undesirable effects of reduced business activity and employment. Cer- 
tainly it is fair to say that the present Administration is more con- 
cerned about avoiding inflation than were its predecessors. Yet, it ap- 
pears to the reviewer that the present Report is correct in emphasizing 
the desirability of avoiding either extreme and that the policy sugges- 
tions have been largely consistent with this goal. 

On the whole, the current Report is well done and is a credit to the 
Council of Economic Advisers, which assisted in its preparation. It 
should prove of interest and use to all who are interested in the applica- 
tion of economic analysis to the formulation of public policy. 
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Armitage, P., “A note on the time-homoge- 
neous birth process,” Journal of the Royal 
Statistical Society, 15 (1953), 90-91. 

Some limiting properties of the time- 
homogeneous birth process, valid for small 
exposure times, are obtained. Included are 
an expression for pz(?), the probability that 
exactly z events take place in the time in- 
terval (0, 7’), and a demonstration that the 
mean exposure time remaining in the time 
interval (0, 7) after the zth event has oc- 
curred is asymptotically T/(z+1). RicHARD 
G. CornELL, Virginia Polytechnic Institute. 


Bailey, N. T. J., “On queueing processes 
with bulk service,” Journal of the Royal 
Statistical Society, Series (B), 16 (1954), 80- 
87. 

This paper investigates mathematically 
the queueing problem in which a single 
queue forms from elements arriving at ran- 
dom, order of arrival being preserved, and 
the queue being then served in groups. The 
mean and variance for the queue length, 
and the average waiting time is derived. A 
crude but useful inequality for the average 
waiting time is established with a sug- 
gested application to hospital out-patient 
clinics. R. I. Taytor, Virginia Polytechnic 
Institute. 


Basu, D., “On the optimum character of 
some estimators used in multistage sam- 
pling problems,” Sankhya, 13 (1954), 363- 
68. 


Three different methods of selecting pri- 
mary units from strata are discussed: pri- 
mary units chosen with replacement and 
with different probabilities, primary units 
chosen without replacement and with equal 
probabilities, primary units chosen with re- 
placement, and using the first estimate for a 
primary unit r times if that unit occurs r 
times. For the three methods, unbiased 
estimators are simply deduced under gen- 
eral conditions, and these estimates are 
* shown to be the “best” estimates within a 
class of estimators. For the first and last 
methods, the problem of selecting the 
probabilities of selection of given units is 


studied and optimum theoretical solutions 
obtained. Paut N. SomMERvVILLE, Virginia 
Polytechnic Institute. 


Basu, D., and Laha, R. G., “On some char- 
acterization of the normal distribution,” 
Sankhya, 13 (1954), 359-62. 


Geary’s theorem is extended by proving 
that if the sample mean is distributed in- 
dependently of any K,;(r 22) (where K,is the 
unbiased estimator as given by Fisher of 
the rth population cumulant K,) then the 
parent population is normal. T. S. Russet, 
Virginia Polytechnic Institute. 


Box, G. E. P., “The exploration and ex- 
ploitation of response surfaces: Some gen- 
eral considerations and examples,” Bio- 
metrics, 10 (1954), 16-59. 


When the experimenter has several 
quantitative variables the levels of which 
he may control in an effort to maximize 
yield, or minimize cost, classical methods 
are not particularly apropos. The basic 
idea is that response Y may be representable 
as a second degree function of the quantita- 
tive factors 2:22 - + + 2%. The coefficients can 
be estimated only if the experimental com- 
binations are suitably chosen. The inter- 
pretation of the fitted response function is 
greatly facilitated by reduction of the gen- 
eral quadratic obtained to canonical form 
in variables X,X2° + + X; where Xi°* + Xz 
are (after choice of origin) obtained by 
orthogonal transformation of 2:22 + + + 2x. In- 
terpretation of the response function in 
terms of X; - - - Xx may lead to recognizing 
factor combinations which provide higher 
yield, or to identifying factor combinations 
of equivalent yield but varying cost or con- 
venience. 

The principles are illustrated with worked 
examples including good figures and tables. 
L. E. Mosss, Stanford University. 


Box, G. E. P., and Hunter, J. S., “A con- 
fidence region for the solution of a set of 
simultaneous equations with an application 
to experimental design,” Biometrika, 41 
(1954), 190-99. 
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The authors extend Fieller’s theorem to 
determine an exact confidence region for 
the solution to a set of k linear equations. 
The confidence region depends on (1) the 
magnitude of the errors in estimating the 
coefficients in the equations and (2) the 
state of the conditioning of the equations. 
A poorly conditioned set of equations is one 
in which one or more of the equations are 
almost linearly dependent on other equa- 
tions. An example is presented of a well and 
a poorly conditioned pair of equations. An 
equation for the boundary of the confidence 
region is given and illustrated for an em- 
pirically determined stationary point on a 
fitted quadric surface with two independent 
variables. This procedure is generalized to 
confidence limits for a stationary point on a 
surface represented by any equation linear 
in the coefficients. G. I. Pau, North 
Carolina State College. 


Broadbent, S. R., and Kendall, David G., 
“The random walk of Trichostrongylus re- 
tortaeformis,” Biometrics, 9 (1953), 460-66. 


The life cycle of an intestinal parasite of 
sheep or rabbits, Trichostrongylus re- 
tortaeformis, involves a phase in which the 
larva wanders apparently at random until 
he climbs a blade of grass, where he re- 
mains until eaten. The statistical problem 
treated is to find the distribution of larvae 
thus trapped on blades of grass. It is as- 
sumed that the larva’s motion until trap- 
ping is a random walk of the Brownian 
movement type; two hypctheses defining 
the probability of being trapped are con- 
sidered. Both models are analyzed. Some 
empirical data suggest that the Brownian 
motion assumption may not be justified. 
L. E. Mosss, Stanford University. 


Bryan, W. Ray, “The relative precision of 
dose response data in the virus and tumor 
fields as compared with that in certain other 
fields of biology,” Journal of the National 
Cancer Institute, 15 (1954), 305 ff. 

A commonly used “index of precision” 
for a biological assay technique is defined 
as\=a/B8 wherec is the standard deviation 
of the response metameter around the 
(straight line) regression of that metameter 
on log dose, and £ is the slope of the regres- 
sion line. The usefulness of the index is 
commented on. A large number of bioassays 
of many different kinds of biologically 
active agents were reviewed; from these 
studies characteristic ranges of values for 
\ are presented in summary form, classified 
by type of agent. L. E. Moses, Stanford 
University. 
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Chanda, K. C., “A note on the consistency 
and maxima of the roots of likelihood 
equations,” Biometrika, 41 (1954), 56-61. 


The consistency of maximum likelihood 
estimates for a k-parameter system is 
proven by use of assumptions which are 
different from, and claimed to be possibly 
stronger than, those of Wald. R. L. ANprEr- 
son, North Carolina State College. 


Claringbold, P. J., Biggers, J. D., and 
Emmens, C. W., “The angular transforma- 
tion in quantal analysis,” Biometrics, 9 
(1953), 467-84. 

One of several transformations alterna- 
tive to the probit transformation is given by 
$(p) =sin™ +/p. If for p the observed frac- 
tion is used, a quick non-iterative solution is 
obtained. If for p the expected fraction of 
response is used, an iterative process gives 
maximum likelihood solution. The two 
methods were both used on several experi- 
ments and differences in results were small. 
Little or no information inheres in zero or 
hundred per cent responses. Choice of suit- 
able experimental design will reduce in- 
efficiency from this source. A parallelogram 
design for factorial experiments with quan- 
tal response is introduced and illustrated: 
L. E. Moses, Stanford University. 


Cochran, William, and Carroll, Sarah 
Porter, “A sampling investigation of the 
efficiency of weighting inversely as the 
estimated variance,” Biometrics, 9 (1953), 
447-59. 

Let 2j(¢=1, - + - , k) be normally and in- 
dependently distributed around a common 
uw, but with different variances o,?. If the 
o;? are known then the best estimate of u is 


k 
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If the o;? are unknown, but estimates s;? 
with n; degrees of freedom are in hand, one 
might estimate u by 
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(it is unbiased) for various values of k and n 
(taking all n; =n). Sampling methods afford 
most of the results. The ratio 0z52/o2.? in- 
creases with k. Comparison is made with a 
formula of Meier for estimating oz, and an 
empirical adjustment to it is proposed. L. E. 
Moses, Stanford University. 


Cohen, A. C., and Woodward, John, “Tables 
of Pearson-Lee-Fisher functions of singly 
truncated normal distributions,” Bio- 
metrics, 9 (1953), 489-97. 


The estimation of u and a in the frequency 


function 
e7 (1/2) (z-H/¢)2, sxrso 


1 
S(t) = Fite 


by the method of maximum likelihood leads 
to the solutioca of awkward transcendental 
equations. Tables are given which eliminate 
nearly all the computational labor beyond 
calculating Xz and <z*, the sufficient 
statistics. L. E. Moses, Stanford University. 





Coombs, C. C., “A method for the study of 
interstimulus similarity,” Psychometrika, 19 
(1954), 183-94. 

Various methods for constructing uni- 
dimensional scales along a psychological 
dimension are discussed. The methods that 
Coombs presents in this article are partic- 
ularly appropriate for the development of 
metric scales peculiar to individuals and for 
testing whether different individuals have 
the same latent structure for a given set of 
stimuli. Information provided by several of 
the methods for collecting data to be scaled 
is evaluated; the weaknesses and strengths 
of each of the proposed methods are com- 
pared. In addition to evaluating standard 
methods for scaling, Coombs introduces the 
method of similarities and the method of 
cartwheels; the utility of variations of 
these latter methods is also pointed out. 
B. J. Winer, Purdue University. 


Dixon, W. J., “Power under normality of 
several nonparametric tests,” Annals of 
Mathematical Statistics, 25 (1954), 610-16. 


The author computed the power of four 
nonparametric tests (rank-sum, maximum 
deviation, median and total number of 
runs) for the difference in means of two 
(small) samples (Ni = N25) drawn from a 
normal population with ual variance 
against the alternative | ui —p2| /o with small 
level of significance. He used the power 


_ efficiency function to compare these with 


the t-test and he found that the four non- 
parametric tests have high power efficiencies. 
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The power efficiency decreases slightly for 
more distant alternatives and as the level 
of significance increases it increases slightly 
for the rank sum test while that of the 
median and maximum deviation tests de- 
crease. The author calculated also the power 
efficiency for the tests randomized to a 
single level of significance a = .025 to make 
the comparison simpler. From this he found 
that the rank sum test has greater power 
than the median and maximum deviation 
tests. Furthermore, he calculated the limit- 
ing power efficiencies for the rank sum test 
for Ny; .N2S5 and he found that the local 
power efficiencies for this test is greater 
than 3/x which is the limiting local power 
efficiency for large samples. A. E. SARHAN, 
University of North Carolina. 


Dunnett, C. W., and Sobel, Milton, “A 
bivariate generalization of Student’s ¢-dis- 
tribution with tables for certain special 
cases,” Biometrika, 41 (1954), 153-69. 

The authors consider the simultaneous 
distribution of two variates, t;=2,/s and 
tg=2z2/s. The z; follow a normal bivariate 
distribution with zero means, the same 
variance o”, and correlation p. The variance 
o? is assumed independently estimated by 
8? with n degrees of freedom. The probabil- 
ity integral is 

Prob {t; hj t2 sh} = P. 


Tables of P and h are presented for n=1 (1) 
30 (3) 60 (15) 120, 150, 300, 600 and «, and 
p=0.5 and —0.5. P is given to five decimal 
places for h=0O (.25) 2.50 and 3.00, plus 
some additional values for larger h when n 
is small. h is given to three decimal places 
for P=.50, .75, .90, .95 and .99. An asymp- 
totic expansion is derived for P and h; 


4 4 
P= 2 A,;/n* and h= = B;/n’‘. 
t—0 t—0 
Values of the A; and B; are presented for 
the same values of h and P mentioned 
above. This distribution has applications 
in certain multiple decision problems. R. L. 
ANDERSON, North Carolina State College. 


Ghurye, S. G., and Robbins, Herbert, “Two- 
stage procedures for estimating the differ- 
ence between means,” Biometrika, 41 
(1954), 146-52. 

Given two populations, P;, with unknown 
means, 6;, and variances, o;?. Samples of n; 
are obtained, at a total cost not to exceed 
Ap, to estimate 6 = 6; — 62. Assuming the cost 
is ayn; +aene+a3S Ao, the minimum vari- 
ance estimate is obtained when n,;°= (A/a;) 
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Vaid(/a;;), where A=Ao—a;. A two- 
stage sampling plan is considered when the 
o; must be estimated from initial samples of 
m; {rom P;. These estimates are used to de- 
termine the additional number of observa- 
tions needed to estimate @ as above. For 
normal P;, the variance of this estimate is 
compared with that obtained when the o; 
are known for 2n,=2nz=N=30 and 50; 
m=m2,=m(m/N=.2, .3, .4); and o2/o, 
=1.00 (0.25) 3.00. In no case does the 
necessity of estimating the o; increase the 
variance by 10% and generally much less. 
It is also shown that the ratio of the two 
variances is asymptotically unity for P; 
with finite o;?, which also mee’ some other 
not very restrictive requirements. The usual 
sampling procedure of assuming o1=¢2 is 
asymptotically inferior to the two-stage pro- 
cedure when o;<o2. E. E. Sato, North 
Carolina State College. 


Guest, P. G., “Grouping methods in the 
fitting of polynomials to equally spaced 
observations,” Biometrika, 41 (1954), 62-76. 

Given n equally spaced observations. The 
first part presents methods of fitting by use 
of N groups (n=rN), including estimated 
standard errors. Relative efficiencies are 
computed, including the case when some 
observations must be omitted because n is 
prime; the loss of efficiency in the latter 
case may be very serious, especially for 
high-degree coefficients. A second part is 
devoted to the use of step-functions. By use 
of double-step functions, high efficiency is 
obtained for second and third degree 
polynomials. The chief weakness of step- 
function fitting is the difficulty of estimat- 
ing standard errors and the degree of poly- 
nomial to be used; it is particularly useful 
when previous experience indicates a linear 
relationship is satisfactory. Comparative 
time studies are included of the various 
methods of fitting. R. L. ANpERSON, North 
Carolina State College. 


Jonckheere, A. R., “A distribution-free 
k-sample test against ordered alternatives,” 
Biometrika, 41 (1954), 133-45. 

By extending the definition of Kendall’s 
S statistic of rank correlation for two rank- 
ings to the case of k rankings, the author 
furnishes a test of the hypothesis that k 
samples were obtained from a single popula- 
tion against the alternative that they were 
drawn from a specific ordering of k popula- 
tions. The exact distribution of S is derived 
for the general case and is expressed in the 
form of a recursion formula for the special 
case in which the samples are of equal size. 





AMERICAN STATISTICAL ASSOCIATION JOURNAL, MARCH 1955 


Tables of Prob (S2Spo) are given for certain 
combinations of 3Sk56, 2SmsS65, and 
0<So96 for tests in which the k samples 
are all of size m. The author shows the 
limiting distribution to be normal when at 
least two of the k samples increase without 
bound and gives formulas for approximate 
tests using the normal and ¢ distributions 
when samples are large. D. A. GARDINER, 
North Carolina State College. 


Mandel, L., “Grading with a gauge subject 
to random output fluctuations,” Journal of 
the Royal Statistical Society, Series B, 16 
(1954), 118-30. 

“A study is made of the errors arising in 
grading a normal population into classes 
with a gauge subject to output fluctuations. 
This involves evaluating the bivariate nor- 
mal integral when the correlation coefficient 
is close to unity and series for this are de- 
veloped. Curves are derived showing the 
dependence of the errors on the population 
variance and on the instrumental noise 
level. Means of reducing some errors by 
small adjustments of the gauging limits are 
also examined. The results have a direct 
bearing on the industrial applications of 
B-ray thickness gauges.” Hate C. Sweeny, 
Virginia Polytechnic Institute. 


Mann, H. B., “A theory of estimation for 
the fundamental random process and the 
Ornstein Uhlenbeck process,” Sankhya, 13, 
Part 4 (1954), 324-50. 

In Chapter 1, the author discusses a ran- 
dom process of the form y;=2;+/(t), where 
az; is a Fundamental random process and 
S(t) a function satisfying given assumptions. 
Maximum likelihood estimates are derived 
for the variance constant of the Funda- 
mental random process and the parameters 
of f(t). Together with a discussion of the 
optimum properties of these estimates a 
test of significance is provided for testing 
the hypothesis f(t)=0 against specific al- 
ternatives. In Chapter 2 the process z; is 
taken to be an Ornstein Uhlenbeck process 
depending on two parameters 8 and o?. 
Methods of estimation are developed for 8, 
o*, and the parameters in f(t). Variances and 
covariances are given for the estimates of 
the parameters of f(t). Joun E. Frevunp, 
Virginia Polytechnic Institute. 


Mann, Henry B., and Moranda, Paul B., 
“On the efficiency of the least square esti- 
mates of parameters in the Ornstein Uhlen- 
beck process,” Sankhya, 13, Part 4 (1954), 
351-58. 

Considering a random process of the form 
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y:=2z:+f(t), where z; is an Ornstein Uhlen- 
beck process and the function f(t) is of a 
given form and depends on a set of param- 
eters k;, the limiting form of the maximum 
likelihood equations for the k; is derived 
under the assumption that 8, one of the 
parameters of the Ornstein Uhlenbeck 
process, is known. The maximum likelihood 
estimates are subsequently compared with 
the corresponding least square estimates 
of the k; obtained without the assumption 
that 8 is known. It is shown that these two 
sets of estimates are asymptotically of 
equal efficiency. Joun E. Freunp, Virginia 
Polytechnic Institute. 


Masuyama, Motosaburo, “Analysis of the 
1939 model sample survey results from the 
viewpoint of integral geometry,” Sankhya, 
13, Part 3 (1954), 229-34. 

When sampling a geographical region, a 
frequent method of choosing sub-areas to 
be sampled is to place a square or grid on a 
map according to certain prescribed pro- 
cedures which need not concern us here. 
The sub-area chosen is then the area con- 
tained in the grid. The cost of running such 
a survey is largely dependent on the number 
of plots or fields contained in the grid since 
this is the number (p) of fields which must 
be enumerated. The author discusses a 
formula for estimating p in terms of the 
size of grid a?, the total area covered by the 
survey in question 7’, the total area of plots 
¢, the total length of their perimeters A, and 
the total number of plots in 7, say v. The 
formula is: p=@/T+(2A/rT)a+(V/T)a?*. 
Since the first term on the right of this 
equation is less than one, we may neglect it 
and use the formula p= (2A/4rT)a+(V/T)a?. 
W. A. THompson, JR., Virginia Polytechnic 
Institute. 


Mathen, K. K., and Poti, S. J., “An adjust- 
ment for the effect of changing birth rates 
on infant mortality rates,” Sankhya, 13 
(1954), 417-22. 

A theoretical basis is established for esti- 
mating weights used in measuring the rela- 
tive mortality rate in a given year of babies 
born in that year to those born the previous 
year. It is shown that these weights are the 
same regardless of whether we assume a 
linear trend in the birth rate over the two 
year period or we assume a constant, but 
not necessarily equal, birth rate for each of 
the two years. The estimation process is 
illustrated using data obtained in India, the 
United Kingdom, and the United States. 
Ricuarp G. CorneE ut, Virginia Polytechnic 
Institute. 
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Mukherjee, M., “Estimation of national 
consumption of the United Kingdom from 
family budget studies,” Sankhya, 13 (1954), 
412-16. 

Estimates of the national consumption 
pattern from the consumption pattern as 
given by the family budget inquiries are 
compared with the official figures of the 
national consumption pattern. Logical ex- 
planations are given for the discrepancies 
between the estimates and the official 
figures. T. S. Russewy, Virginia Polytechnic 
Institute. 


Page, E. S., “An improvement to Wald’s 
approximation for some properties of 
sequential tests,” Journal of the Royal 
Statistical Society, Series B, 16 (1954), 136- 
39. 

“A simple modification of the Wald ap- 
proximation to the operating characteristic 
and average sample number of a sequential 
test is given which provides estimates of 
these values for starting points of the test 
near the boundaries and an improved ap- 
proximation for general starting points.” 

Wald’s approximate formula for the op- 
erating characteristic of a sequential test is 
valid for starting points not near either 
boundary and for mean paths inclined at 
not more than a small angle to the bound- 
aries. The modification in this paper pro- 
vides a better approximation. R. A. 
Brabtey, Virginia Polytechnic Institute. 


Page, E. S., “Control charts for the mean of 
a normal population,” Journal of the Royal 
Statistical Society, Series (B), 16 (1954), 
131-35. 

Discussed in this paper is a manner of 
choosing the sample size and control limit 
for control charts of the mean of some di- 
mension of a manufactured article. Tables 
for finding these values for most economical 
operation are included for the “one-sided” 
chart (a chart with a single control limit). 
Tables can also be used for the two-sided 
case. One set of tables minimizes the aver- 
age run length under undesirable conditions 
(more than a given deviation from the 
mean) for a given average run length under 
ideal conditions. The other set of tables 
Maximizes the average run length under 
ideal conditions for a given average run 
length under undesirable conditions. R. I. 
Tayrtor, Virginia Polytechnic Institute. 


Patterson, H. D., “The errors of lattice 
sampling,” Journal of the Royal Statistical 
Society, Series (B), 16, Part 1 (1954), 140- 
49. 
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This paper presents a scheme for using 
lattices as a pattern for sampling certain 
types of populations. The samples included 
equal number of units from each main class 
or from combinations of two, three, or more 
main classes in different classifications. 
Methods are given for estimation of the 
sample variances from sample data. In gen- 
eral, estimates of errors can be obtained 
from samples consisting of special patterns 
of units or from random samples of several 
mutually exclusive sets each of which itself 
consistutes a lattice sample. The author 
uses the following lattices in his sampling 
scheme: lattices with “n” groups, square 
lattices, cubic lattices, and rectangular lat- 
tices. The case of the square lattice is given 
in detail and rigorous proofs of the various 
formulas are given. For the other type 
lattices, in general, simple expressions for 
the sampling errors are shown. Boyp 
HARSHBARGER, Virginia Polytechnic Insti- 
tute. 


Pillai, K. C. S., and Ramachandran, K. V., 
“On the distribution of the ratio of the ith 
observation in an ordered sample from a 
normal population to an independent esti- 
mate of the standard deviation,” Annals of 
Mathematical Statistics, 25 (1954), 565-72. 


In problems dealing with the distribution 
of any order statistic from N(0, 1) popula- 
tion, we run into the powers of the incom- 
plete normal integral in the interval — ~ to 
z and 0 to z. The authors have given an 
expansion of the normal probability integral 
and its powers in the interval — © to z and 
0 to z and have used the results to obtain 
the distributions of the studentized maxi- 
mum modulus u, and the studentized ex- 
treme deviate qn, from the population 
mean. Upper 5 per cent points of the stu- 
dentized extreme deviate from the popula- 
tion mean and upper and lower 5 per cent 
points of the u, are given for small sample 
sizes and for different d.f. of the sample 
standard deviation. Upper points of gn and 
Un are useful for simultaneous confidence 
interval estimation. Lower points of u, are 
useful for tests of normality against rec- 
tangular alternatives. A. E. Sarwan, Uni- 
versity of North Carolina. 


Pimentel-Gomes, F., “The use of Mitscher- 
lich’s regression law in the analysis of ex- 
periments with fertilizers,” Biometrics 9 
(1953), 498-516. 

Let Y be the yield at a level of fertilizer 
application X. Mitscherlich’s Law postu- 
lates that E(Y) =a+ fp? werea, B, and p are 
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unknown parameters, and have operational 
meaning in the context. If 4 or 5 equally 
spaced levels of X are used and a least 
squares fit is attempted, the resulting 
equations, though formidable, can be 
solved non-iteratively by the use of some 
tables given in the paper. Applications of 
the model are discussed. An example is 
worked. L. E. Mossgs, Stanford University. 


Plackett, R. L., “The truncated Poisson 
distribution,” Biometrics, 9 (1953), 485- 
88. 

The Poisson distribution is defined by 
Pr=(\'e)/r! Cases arise where no ob- 
servations are available for r=0. A method 
of unbiased estimation of 6(A) (some func- 
tion of the parameter) is proposed. Where 
6(A) =A the method has high efficiency and 
advantages in computational ease. L. E. 
Mosgs, Stanford University. 


Read, D. R., “The design of chemical experi- 
ments,” Biometrics, 10 (1954), 1-15. 

Many experimental programs in chem- 
istry have as their purpose to determine 
that combination of levels of controllable 
conditions which will conduce to maximize 
yield. Box and Wilson have given methods 
for designing and analysing such experi- 
ments. These methods are discussed in a 
clear and elementary way and an illustra- 
tive example is given. L. E. Moses, Stan- 
ford University. 


Ruben, H., “On the moments of order 
statistics in samples from normal popula- 
tions,” Biometrika, 41 (1954), 200-27. 

The moments of order statistics derived 
from normal populations, as well as the 
moment-generating function of the square 
of any order statistic, are shown to be 
closely related to the contents of the mem- 
bers of a class of hyperspherical simplices. 
Formulas for computing the contents of 
regular hyperspherical simplices and the 
corresponding moments of the extreme 
order statistics are derived. For samples of 
up to fifty items, selected surface contents 
are tabulated plus the first ten moments 
about the origin and the first four moments 
about the mean of extreme members to 
more decimal places than in previous pub- 
lications. The author hopes to compute by 
electronic means the relative contents of 
skew hyperspherical simplices and to apply 
these values for the derivation of at least 
the first four moments of order statistics 
which are not extreme. T. M. KE.LLEHER, 
North Carolina State College. 
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STATISTICAL ABSTRACTS 
“On the confluent hypergeo- 


Sankyha, 13 
(1954), 369-76. 
Rushton, S., and Lang, E. D., “Tables of 
the confluent hypergeometric function,” 
Sankhya, 13 (1954), 377-411. 
Certain properties of the confluent hyper- 
geometric function 


Rushton, S., 
metric function M(a, y, 2),” 


a(a+1) x? 
M(a, y; 14+— -— 
(c, 2) = <4 ers 2! 


, a(a +1) (a +2) 
" y(y+1)(y +2) 


x3 
— +s 





are stated and reference is made to the differ- 
ent forms in which this function can be 
written. The important Kummer’s relation 
M(a, y, z)=e*-M(y-—a, y, —2), some 
fundamental recurrence formulas and 
Whittaker’s form of the confluent hyper- 
geometric function, Wim(Z), is mentioned. 
The application of M(a, y, x) in sequential 
tests of composite hypotheses in the follow- 
ing cases is indicated: (a) the one-sided 
sequential t-test, (b) the sequential F-test, 
and (c) the two-sided sequential t-test as a 
special case of (b). 

The construction of tables of the con- 
fluent hypergeometric function and meth- 
ods of extending such tables are considered 
in the paper. The tables published give 
values of M(a, y, x) for y=0.5, 1.0, 1.5, 2.0, 
2.5, 3.0, 3.5, and 4.5, z= .02(.02) —.10(.10) 
— 1.0(1.0) —10, 20, 30, 50, 100, 200 and a 
range of half-integer and/or integer values 
of a from 0.5 to approximately 50. The 
values of the function are given to 7 signifi- 
cant figures. These tables must be regarded 
as companion tables to those constructed by 
Nath (Sankhya 11 (1951)) giving values of 
the function for y=3 and y=4. D. E. W. 
ScHuMANN, Virginia Polytechnic Institute. 


Som, Ranjan K., “Seasonality in the in- 
cidence of strikes in the Bombay textile 
industry,” Sankhya, 13 (1954), 423-28. 


Monthly data from January 1943 to 
December 1951 of the number of strikes 
“started in each month” in the textile in- 
dustry in Bombay state is analyzed with the 
conclusion that there is a seasonal strike 
pattern. Methods used are “heuristic in na- 
ture.” First, paired ¢’s for all combinations 
of months are computed and x? used to 


* test the agreement between the theoretical 


and observed ¢-distributions. Second, using 
a method attributed to A. Wald. effects due 
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to trends are eliminated and an analysis of 
variance of seasonal variations performed. 
Paut N. SomERvVILLE, Virginia Polytechnic 
Institute. 


Steinhaus, H., “Quality control by samp- 
ling (a plea for Bayes’ Rule),” Colloquium 
Mathematicum, 2 (1950), 98-108. 

The author defends the “outmoded” 
thesis that in problems of sampling inspec- 
tion, and the like, it is not unwise to adopt 
Bayes’ rule, more precisely, to behave as 
though it were known a priori that lot 
qualities are uniformly distributed from 
zero to one. Three sorts of argument are 
adduced for this highly controversial the=*s: 
According to a line of thought due to vu. 
Oderfeld, “modern” procedures involve un- 
realistic hypothesis for their formal justifi- 
cation no less than does Bayes’ rule. In 
certain main problems, behavior based on 
Bayes’ rule differs little, either in its form 
or consequences, from behavior based on 
modern procedures. Several illustrations 
suggest that Bayes’ rule is more stimulating 
to the practical resolution of new problems 
than are “modern” procedures. L. J. 
SavaGE, University of Chicago. 


Thurstone, L. L., “An analytical method for 
simple structure,” Psychometrika, 19 (1954), 
173-82. 


One of the most controversial phases of 
applied factor analysis concerns the rota- 
tion of the reference axes in the configura- 
tion of the variables. The concept of simple 
structure was proposed by Thurstone as a 
guiding principle, but this concept lacked a 
precise mathematical formulation. At- 
tempts have been made to give this con- 
cept such formulation; heretofore none of 
these attempts has led to computationally 
feasible techniques. 

In this paper Thurstone reformulates the 
concept of simple structure in mathematical 
terms and describes a practically feasible 
method for achieving simple structure 
analytically. In the past such solutions have 
been achieved largely by graphical methods. 
The essential feature of Thurstone’s 
analytical solution depends upon a criterion 


function 
‘its sige i 
»* > WipYip 
7 


where ?;p is the scalar product of a vector j 
with a reference vector p, and wjp are 
weights assigned to vjp so as to give the 
criterion function certain optimum prop- 
erties with respect to a reference vector p. 
The initial location of the reference vector 
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p is made by inspection; weights are as- 
signed in accordance with a predetermined 
set of values, such that vectors nearly 
orthogonal to » are given maximum weights 
and those vectors having large projections 
on 7 are given minimum weights. 
Successive adjustments are made in the 
initial location of the reference vector p 
until the criterion function @ assumes a 
minimum value. Detailed computational 
procedures for making initial and iterative 
adjustments are given; a numerical example 
is presented. The adaptation of this rotation 
method to IBM equipment is said to be in 
progress. B. J. Winer, Purdue University. 


Tsao, Chia Kuel, “An extension of Massey’s 
distribution of the maximum deviation be- 
tween two-sample cumulative step func- 
tions,” Annals of Mathematical Statistics, 
25 (1954), 587-592. 

If we have two random samples (each 
has elements arranged in ascending order) 
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of sizes n and m drawn from continuous 
distributions with cumulative functions 
F(z) and G(z) respectively, then we can use 
two statistics d, and d,’ (where r represents 
the rth observation) devised by the author 
to test the hypothesis F(x)=G(z). He 
derived the distribucion of these two sta- 
tistics under the hypothesis F(z) = G(z) and 
tabulated their probabilities for m=2. He 
also showed that if r=m=n, the distribu- 
tions of both d, and d,’ reduce to Massey’s 
distribution and if r=1, d, reduces to a 
special case of the exceedance problem of 
Gumbel and von Schelling. He illustrated 
the use of these statistics with an example 
of censored samples (in life testing). He also 
showed for situations where the observa- 
tions below certain ordered observations 
are missing, or if the observations are 
available in descending order, that the two 
statistics D, and D,’ are to be used and their 
distributions are identical with those of d, 
and d,’. A. E. Sarwan, University of North 
Carolina. 
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BOOK REVIEWS 


The Economic Report of the President, U. S. Government Printing Office, 
January 1955. Pp. x, 203. $0.75. 


See the article by Beryl Wayne Sprinkel, pp. 240-248 in this issue. 


Standard of Living in India and Pakistan, 1931-32 to 1940-41. R. C. Desai. 
Bombay: Popular Book Depot. 1953. Pp. xvii, 286. Rs. 20. 


Morris Davip Morais, University of Washington 


Dz India’s colonial period more than a dozen attempts were made to 
compute the country’s national income, the most notable being Rao’s 
estimate for 1931-32. Since Independence the official National Income Com- 
mittee has worked out estimates for the years 1948-49 through 1950-51. 
Now R. C. Desai has made another attempt, this one to determine the level 
of consumer expenditures for the ten year period 1931-32 through 1940-41. 
His study is not strictly comparable with either of the others. Being a study 
of consumer expenditures, it is merely one step towards the determination of 
national income. At the same time the area covered is greater, including the 
whole of what is now India and Pakistan. 

Although India is statistically the best served of all underdeveloped areas, 
the scholar who attempts to use the vast masses of data finds that they crum- 
ble in his hands. One of the most striking features of the volume is that in his 
effort to construct his consumer expenditures estimates Desai has explored 
virtually all the available statistical material. I can recall no other single 
volume where the limitations and qualifications of the Indian data are more 
carefully described. While it is desirable to build up national income esti- 
mates independently from income, output, and expenditures data, Desai is 
forced to construct his tables largely on the basis of product estimates. In 
fact, what he calls consumer expenditures are mainly estimates of current 
product available for consumer purchase. 

The book is divided into three parts. The first part develops estimates of 
production for crops and crop products, livestock products and fish, textiles, 
and miscellaneous items taken by consumers. The second section works out 
estimates of quantities available for consumption. His estimates of quanti- 
ties available for consumption distinguish between the portions sold on the 
market and those distributed outside of the market mechanism. His analysis 
of the subsistence sector is intelligent and more subtle than one customarily 
finds. Nevertheless, his estimates of ‘gross village retention” of food cereals 
are not very different from the traditional assumption. The traditional esti- 
mates of retained food crops ran to about 60 per cent. Desai suggests that 
59 per cent of total rice production and 49 per cent of total wheat production 
did not enter the money sector. In this same section he estimates interre- 
gional and international movements of products. The third section brings 
him to grips with the problem of translating physical quantities into money 
value terms, including the problem of pricing the quantities that do not 
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reach the market. Here he also estimates the money values of services. 

In his last two chapters he sums up the results of his laborious computa- 
tions, giving us the value of consumer expenditures for his ten year period in 
current prices and at constant (1938-39) prices. Desai concludes that con- 
sumer expenditures at current value work out to Rs. 82.5 per capita for 
1931-32. Rao’s national income estimate for the same year is Rs. 67.5 per 
capita which if adjusted to cover the same area as Desai’s would probably 
be about Rs. 60 per capita. Unless we assume that both private business and 
government expenditures were negative, which seems highly unlikely, Desai’s 
estimate is much higher than Rao’s. My own guess is that Desai’s estimate 
is closer to correct. 

Desai’s figures show that while consumer expenditures rose during the dec- 
ade, the rise was not sufficient to offset the increase in population. Per capita 
consumer expenditure declined by nearly 5 per cent during the period. He has 
no data to show whether priv..e capital formation and government expendi- 
tures were growing rapidly enough to offset the fall in consumer expenditures, 
but his guess is that national income per capita was at least not rising. 

The least satisfying of Dr. Desai’s judgments stems from his attempt to 
calculate the reliability of his estimate. Using the same technique as Geary 
used for his work on the national income of Eire, he concludes that his esti- 
mate of consumer expenditures might err to the extent of +7.2 per cent, 
which seems overly optimistic. In fact using his technique it is impossible to 
tell what the degree of error is. Desai is modest enough, however, not to push 
this judgment. 

There are many weaknesses in the individual estimates with which read- 
ers will quarrel. Many of the estimates are purely conventional. Neverthe- 
less, his work is one of great merit, an impressive piece of cautious computa- 
tion in a field that is exasperatingly intractable. 


Workbook in Business Statistics, Third Edition. Louis F. Hampel. Homewood, 
Illinois: Richard D. Irwin Inc., 1953. (Pages are unnumbered.) $3.50. Paper. 


GrorceE Horwicx, Indiana University 


ERE is a collection of 177 exercises for the beginning student in sta- 

tistics—or rather, descriptive statistics, since very little of statistical in- 
ference, in number or quality, is offered. The problems are aimed at business 
and economics students, all of the examples being drawn from that area. 
This is doubtless a limitation in a good course in statistical methodology, but 
partially compensating is the fact that the data contained in these exercises 
are often interesting on their own account. 

The workbook covers the traditional topics: numbers and ratios, charts 
and tables, investigations (statistical), frequency distributions and measures 
thereof, index numbers, time series, correlation, the normal curve, reliability 
and significance. Hampel is particularly strong in the first three topics, and 
altogether adequate in the rest, given the above qualifications and several 
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more to follow. The problems can be handled rather well without the use of 
desk calculators, and this feature will appeal to instructors with many stu- 
dents and few machines. Some of the exercises are long; others are suited 
for class discussion. A few of the problems duplicate the essence of others, 
permitting alternate assignment in successive years. This should contribute 
to the useful life of the book. 

The reviewer keenly missed problems built around the following measures: 
an index of inequality in connection with the Lorenz curve; the coefficient of 
variation in its use of comparing distributions of the same variable, but dif- 
ferent means (Hampel applies it only to distributions of different variables) ; 
Paasche-type index numbers; index numbers of diverse quantities. Perhaps 
only the last is a serious omission, and this attests to Hampel’s extensive 
coverage of descriptive topics. One feels, however, that problems of a more 
purely geometrical formulation would have been a useful addition to the 
student’s interpretive experience. For example, exercises might require the 
student to analyze frequency polygons (e.g., to pair ogives with their corre- 
sponding non-cumulative curves) or time series in which more or less of the 
various components are present. 

The exercises (168, 169, 170) dealing with the normal curve involve empiri- 
cal distributions which are not quite normal. This is unfortunate, since the 
main relevance of that curve is in sampling theory where the mathematical 
ideal prevails. Hampel has a lot to say about “representative” samples (35, 
36, 38), but there is not much reference to, or distinction from, random sam- 
ples and related issues. The analysis of variance is called for in exercises 176 
and 177, but the samples in the latter exercise will not pass a test for homo- 
scedasticity (at the 1% level). Moreover, variance analysis would seem to 
be a doubtful technique to impose on beginning students. 

What the workbook does, indeed, require is competent problems convey- 
ing the basic theory of testing and estimation. The use of normal and t-tables 
in connection with inferences as to population means is probably the easiest 
way to accomplish this in a first course. An alternative method may be through 
problems concerning the maximum value of a uniform population. In this 
case, the student must have some command of the basic probability laws. 
But whatever the technique, this reviewer will recommend the use of Ham- 
pel’s exercises only if the instructor supplements the workbook with testing 
and estimation problems of his own or someone else’s making. 

A solutions manual accompanies the workbook. The arithmetic appears 
to be accurate, with one exception, and that is in problem 177 where the 
sums of squares are incorrectly computed. In addition, sample size, instead 
of degrees of freedom, is used in that exercise in computing variance. This is 
also done in other problems where the standard error of estimate is involved. 
Hampel seems to be aware of what he is doing, since he labels his solutions, 
“large sample technique.” But this is a vague procedure, and no substitute 
for using degrees of freedom, which are appropriate for large and small sam- 
ples alike. In the answer to problem 36 the author disregards the variability 
within strata as a consideration in allocating the total sample. In problem 
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174 he assumes the binomial variance to be stabilized in a problem (not fully 
defined) of determining sample size. There are other questionable answers 
and assertions, but these are the major ones. 

The workbook is, incidentally, in its third edition. The present edition 
differs from the previous one (1947) in the addition of 27 problems, the re- 
placement of several others, and a modernization of the already mentioned 
data. 


Pictographs and Graphs. Rudolf Modley and Dyno Lowenstein. New York: 
Harper and Brothers, 1952. Pp. 186. $4.00. 


KENNETH W. Harmer, American Telephone and Telegraph Company 


HIS is a revision and expansion of Dr. Modley’s earlier book How to Use 

Pictorial Statistics published in 1937 and now out of print. The present book 
is an excellent reference and source book for anyone who uses pictographs; 
it should also be a helpful book for anyone who is in doubt about whether to 
use them or not. 

The authors not only describe how to plan and design pictograph charts 
and diagrams, but, perhaps even more important, explain why and when this 
method of presentation is suitable and when not. Throughout the book, 
they have included frequent reminders that pictorial presentation is not a 
substitute for accurate analysis and sharply focused comparisons. They are 
alert to the three major mistakes in using this method—using pictures to 
disguise weak data, using them ineptly, and using them for the wrong au- 
dience—and carefully point out how to avoid them. 

The content is divided into twelve chapters: seven about pictograph 
methods, two about conventional charts, one about sources and uses of 
statistical data, and two about production and reproduction methods. 

The chapters explaining what pictographs are, how to design them, and 
how to develop pictograph charts, are the best information on these subjects 
that this reviewer has seen. A fourth chapter, illustrating other pictograph 
uses, should be a valuable aid in using pictographs for schematic and 
diagrammatic presentation. The usage chapters, on who uses pictographs 
and where, provide a good coverage of this subject, and include many good 
examples of pictograph presentation put to work. 

The section on sources and uses of statistical data is rather brief, and 
apparently intended for students and other beginners. So are the two chap- 
ters on “conventional” charts. The first of these, a collection of familiar 
curve, surface, and bar charts, includes only about half of the standard 
chart types, but adequately serves the authors’ purpose: to demonstrate 
that many data are not appropriate for pictograph presentation. The chapter 
on “Cheating with Charts”—contributed by a guest author, Frederick 
Jahnel—is brief but pithy: It points out several basic misuses of graphic 
presentation—both pictorial and conventional—and shows the serious effect 
these may have on the reader’s interpretation of the data. 
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The authors may have done the reader one possible disservice in making 
pictograph presentation seem almost easy. The fact is that this type of chart 
is harder to do well than a conventional chart. If the advice and ideas in this 
book are followed, it will be of great help in producing successful pictograph 
charts and diagrams; but no one is going to turn out pictorial graphics as 
expert as these until he has gained some of the authors’ experience and know 
how. 


Statistical Presentation. John H. Myers. Paterson, N. J.: Littlefield, Adams & 
Co., 1950. Pp. 68. $.75. Paper. 


Kenneta W. Haemer, American Telephone and Telegraph Company 


His title in the Littlefield college outline series is a good synopsis of how 
to present data in tabular and graphic form. 

The author’s statement that “This book provides, in simple language, a 
basic guide to the various devices that have been found helpful in practice” 
is a good description of its content and style. Because it is brief, the book 
omits many details; and specialists in this field may feel that there are a 
great many more things that could be said. There are; but the author has 
managed to include most of the key ideas needed for a general introduction 
to the subject. 

Professor Myers makes several valuable points in this small book: in 
addition to illustrating and discussing the main forms of presentation, he 
emphasizes the importance of choosing the form of presentation best suited 
to the purpose. This underlines the need for clearly defining what the exact 
purpose of the presentation is. Failure to do this produces as many poor 
charts and tables as an insufficient knowledge of presentation methods. 

The author points out the difference between charts for presentation and 
charts for analysis or computation, emphasizes the importance of presenting 
an accurate picture of the data, and explains the more common types of 
presentation errors. The use of tabular methods for analysis is not mentioned, 
but the two basic kinds of tabular presentation are summarized and the 
major details of tabular design reviewed. 

The author says nothing about the presentation of statistical information 
in words. Yet, there are some important things that could be said on this 
subject, even in a brief treatise such as this. Perhaps in the next revised 
edition of this book, Professor Myers will include some discussion of how to 
present statistical information in text form. 

The publishers describe this book as “a two-color outline.” The second 
color, green, is used somewhat more for decoration than utility, which seems 
to be a waste. In addition to using the color for major topical headings, they 
might well have used it in the charts and tables for emphasis and clarity. 

In general, this book provides a sound introduction to statistical presenta- 
tion. It should be a good supplementary text for students, and a reliable 
reference for business and professional workers in statistics who are inex- 
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perienced in presentation or whose experience has been limited to a narrow 
field. 


Biometrika Tables for Statisticians, Volume I. Edited by E. S. Pearson and 
a Hartley. Cambridge and New York: Cambridge University Press, 1954. 
ee A COMPLETE recasting of the two volumes of Tables for Statisticians 

and Biometricians (1914, 1931) has been undertaken by Professor 
E. 8. Pearson and Dr. H. O. Hartley,” says an announcement of these im- 
portant tables. “Volume I of the new series contains 12 of the most com- 
monly used tables from the earlier volumes, 26 tables published subsequently 
in Biometrika, mostly since 1940, and 16 tables freshly compiled or drawn 
from other sources. The combination represents a selection of tables most 
often needed by statisticians and experimentalists in the analysis of their 
data. In the case of certain of the more fundamental items rather more ex- 
tensive tabulation than is normally required has been carried out, so as to 
provide the accuracy needed by those concerned with mathematical develop- 
ments. 

“The 54 tables are preceded by a substantial [102 page] Introduction. This 
gives definitions of the functions tabulated, some account of methods of 
interpolation, where required, and many illustrations of the use of the tables. 
In the last connection, fuller accounts are provided for the more specialized 
and more novel tables than for the standard ones whose applications are 
well known.” 

The following tables, grouped into 6 classes, make up the volume: 

I. Tables of the Normal Probability Function. 1. The integral P(X) and 
ordinate Z(X) in terms of the standardized deviate X. 2. Values of —log 
Q(X)=—log } 1—P(X)} for large values of X (Extension of Table 1). 
3. Values of X for extreme values of Q and P (Extension of Table 4). 4. Values 
of X in terms of Q and P. 5. Values of Z in terms of Q and P. 6. Table for 
probit analysis. 

II. Basic Tables Derived from the Normal Function. 7. Probability integral 
of the x?-distribution and the cumulative sum of the Poisson distribution. 
8. Percentage points of the x?-distribution. 9. Probability integral, P(t|v), of 
the t-distribution. 10. Chart for determining the power function of the t-test. 
11. Test for comparisons involving two variances which must be separately 
estimated. 12. Percentage points of the t-distribution. 13. Percentage points 
for the distribution of the correlation coefficient, r, when p=0. 14. The 
z-transformation of the correlation coefficient, z=tanh™r. 15. Charts giving 
confidence limits for the population correlation coefficient, p, given the 
sample coefficient, r. Confidence coefficients 0.95 and 0.99. 16. Percentage 
points of the B-distribution. 17. Chart for determining the probability level 
of the incomplete B-function, J,(a, b). 18. Percentage points of the F-dis- 
tribution (variance ratio). 19. Percentage points of the largest variance ratio, 
Smax.?/80*. 
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III. Further Tables of Probability Integrals, Percentage Points, etc., of Dis- 
tributions Derived from the Normal Function. 20. Moment constants of the 
mean deviation and of the range. 21. Percentage points of the distribution 
of the mean deviation. 22. Percentage points of the distribution of the range. 
23. Probability integral of the range, W, in normal samples of size n. 24. Per- 
centage points of the extreme standardized deviate from population mean, 
(tn —b)/o or (u—21)/o. 25. Percentage points of the extreme standardized 
deviate from sample mean, (z,—2)/o or (—2;)/o. 26. Percentage points of 
the extreme studentized deviate from sample mean, (1,—2)/s, or (4—21) 8». 
27. Mean range in normal samples of size n. 28. Mean positions of ranked 
normal deviates (normal order statistics). 29. Percentage points of the 
studentized range, g=(2%,—721)8,. 30. Tables for analysis of variance based 
on range. 31. Percentage points of the ratio, 8max.?/8min?. 32. Test for hetero- 
geneity of variance: percentage points of M. 33. Test for heterogeneity of 
variance: table to facilitate interpolation in Table 32. 34. Tests for departure 
from normality. A. Percentage points of the distribution of a= (mean devia- 
tion)/(standard deviation). B. Percentage points of the distribution of 
bi =m;/m,**, C. Percentage points of the distribution of bs = m,/m,*. 35. Mo- 
ments of s/s =x/4/v and factors for determining confidence limits for ¢. 

IV. Tables Relating to Certain Discrete Distributions. 36. Test for the sig- 
nificance of the difference between two Poisson variables. 37. Individual 
terms of certain binomial distributions: 


P n P , 
f(i|n, p) = ( i pra -9", 


38. Significance tests in a 2X2 contingency table. 39. Individual terms, 
e~™m‘/i! of the Poisson distribution. 40. Confidence limits for the expectation 
of a Poisson variable. 41. Charts providing confidence limits for p in binomial 
sampling, given a sample fraction c/n. Confidence coefficients, 0.95 and 0.99. 

V. Miscellaneous Tables (Pearson type curves, rank correlation, orthogonal 
polynomials). 42. Percentage points of Pearson curves, for given B:, B2, ex- 
pressed in standardized measure. 43. Chart relating the type of Pearson fre- 
quency curve to the values of §;, 2. 44. Distribution of Spearman’s rank 
correlation coefficient, r,, in random rankings. 45. Distribution of Kendall’s 
rank correlation coefficient, &, in random rankings. 46. Distribution of the 
concordance coefficient, W, in random rankings. 47. Orthogonal polynomials. 

VI. Auziliary Tables. 48. Powers of integers. 49. Sums of powers of in- 
tegers. 50. Squares of integers. 51. Factorials of integers, their logarithms; 
square roots; and their reciprocals. 52. Miscellaneous functions of p and 
qg=1—p over the unit range. 53. Natural logarithms, log.r. 54. Useful con- 
stants. 

W.A.W. 
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Tables of 10*. National Bureau of Standards Applied Mathematics Series 27. 
Buckram bound. Pp. 543. $3.50 (Order from Government Printing Office, Wash- 
ington 25, D. C.). 
pe Bureau’s announcement of this table reads: ‘Although there are a 
number of handy tables of logarithms to 10 or more places, these tables 
necessitate the use of inverse interpolation for finding the antilogarithm. 
Thus, a table of antilogarithms is needed. The present volume gives anti- 
logarithms to the base 10, or 10”, in the form of two tables, a readily inter- 
polable table for 10-decimal accuracy and a basic radix table for 15-figure 
accuracy. When used in conjunction with logarithmic tables in any extensive 
computations involving logarithms and antilogarithms, the Tables of 10? will 
save considerably more labor than will logarithmic tables used alone. 

“The only similar table is J. Dodson’s Antilogarithmic Canon, published 
over 200 years ago. Besides being extremely scarce, Dodson’s table is hard 
to read, the tabular entries are arranged inconveniently, and there are a 
comparatively large number of errors in it. Tables of 10? contains Table I, 
10? for  =0(.00001) 1.00000 to 10D, and Table II, a radix table of 10"*10-?, 
n=1(1)999, p=3, 6, 9, 12, 15 to 15D. Although it has the same interval and 
number of places, Table I alone provides a great improvement over Dodson’s 
table in that it has all known errors corrected, its entries read vertically 
instead of horizontally, and it omits no digits from any entry. The ease of 
performing linear interpolation by machine eliminates the need here for 
differences and proportional parts. The fine interval of 10-* in the argument 
permits determination of the full 10-decimal places by linear interpolation 
alone with a small 10th place correction that can be done mentally.” 

M.A.L. 


Selected Papers in Statistics and Probability. Abraham Wald. Edited for the 
Institute of Mathematical Statistics by T. W. Anderson (Chairman), H. Cramer, 
H. A. Freeman, J. L. Hodges, Jr., E. L. Lehmann, A. M. Mood, and C. M. 
Stein. New York: McGraw-Hill Book Company, Inc., 1955. Pp. ix, 702. $8.00. 
yon useful and important memorial to Abraham Wald opens with a one- 
page biography and an eighteen-page survey of Wald’s work, unsigned 
but presumably written by the Editors. Wald’s complete bibliography is 
then given, numbering 104 items (two-thirds of which appeared during the 
last twelve years before his death (at the age of 48) and the following year. 
There then follow 51 papers covering “most of Wald’s research in statistics 
and probability except for the work included in the books under his author- 
ship,” Sequential Analysis and Statistical Decision Functions. Of the papers, 
one, “Testing the Difference Between the Means of Two Normal Populations 
with Unknown Standard Deviations,” has not been published previously. 
One of the papers is in German, 35 are reprinted from the Annals of Mathe- 
matical Statistics, and 20 have co-authors. 
The format and printing, which is by a photographic process, are excellent 
and the price is commendably low for so large a volume. 
W.A.W. 
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